I didn't resubmitted the job. Also the jobs are submitted one by one with -m yarn-master, not with a long running yarn session so I don't really know if they could mix up.
I will repeat the test with a cleaned state because we saw that killing the job with yarn application -kill left the "flink run" process alive so that may be the problem. We just noticed a few minutes ago. If the problem persists, I will eventually come back with a full log. Thanks for now, Simone 2016-03-16 18:04 GMT+01:00 Ufuk Celebi <u...@apache.org>: > Hey Simone, > > from the logs it looks like multiple jobs have been submitted to the > cluster, not just one. The different files correspond to different > jobs recovering. The filtered logs show three jobs running/recovering > (with IDs 10d8ccae6e87ac56bf763caf4bc4742f, > 124f29322f9026ac1b35435d5de9f625, 7f280b38065eaa6335f5c3de4fc82547). > > Did you manually re-submit the job after killing a job manager? > > Regarding the counts, it can happen that they are rolled back to a > previous consistent state if the checkpoint was not completed yet > (including the write to ZooKeeper). In that case the job state will be > rolled back to an earlier consistent state. > > Can you please share the complete job manager logs of your program? > The most helpful thing will be to have a log for each started job > manager container. I don't know if that is easily possible. > > – Ufuk > > On Wed, Mar 16, 2016 at 4:12 PM, Simone Robutti > <simone.robu...@radicalbit.io> wrote: > > This is the log filtered to check messages from > > ZooKeeperCompletedCheckpointStore. > > > > https://gist.github.com/chobeat/0222b31b87df3fa46a23 > > > > It looks like it finds only a checkpoint but I'm not sure if the > different > > hashes and IDs of the checkpoints are meaningful or not. > > > > > > > > 2016-03-16 15:33 GMT+01:00 Ufuk Celebi <u...@apache.org>: > >> > >> Can you please have a look into the JobManager log file and report > >> which checkpoints are restored? You should see messages from > >> ZooKeeperCompletedCheckpointStore like: > >> - Found X checkpoints in ZooKeeper > >> - Initialized with X. Removing all older checkpoints > >> > >> You can share the complete job manager log file as well if you like. > >> > >> – Ufuk > >> > >> On Wed, Mar 16, 2016 at 2:50 PM, Simone Robutti > >> <simone.robu...@radicalbit.io> wrote: > >> > Hello, > >> > > >> > I'm testing the checkpointing functionality with hdfs as a backend. > >> > > >> > For what I can see it uses different checkpointing files and resume > the > >> > computation from different points and not from the latest available. > >> > This is > >> > to me an unexpected behaviour. > >> > > >> > I log every second, for every worker, a counter that is increased by 1 > >> > at > >> > each step. > >> > > >> > So for example on node-1 the count goes up to 5, then I kill a job > >> > manager > >> > or task manager and it resumes from 5 or 4 and it's ok. The next time > I > >> > kill > >> > a job manager the count is at 15 and it resumes at 14 or 15. Sometimes > >> > it > >> > may happen that at a third kill the work resumes at 4 or 5 as if the > >> > checkpoint resumed the second time wasn't there. > >> > > >> > Once I even saw it jump forward: the first kill is at 10 and it > resumes > >> > at > >> > 9, the second kill is at 70 and it resumes at 9, the third kill is at > 15 > >> > but > >> > it resumes at 69 as if it resumed from the second kill checkpoint. > >> > > >> > This is clearly inconsistent. > >> > > >> > Also, in the logs I can find that sometimes it uses a checkpoint file > >> > different from the previous, consistent resume. > >> > > >> > What am I doing wrong? Is it a known bug? > > > > >