I didn't resubmitted the job. Also the jobs are submitted one by one with
-m yarn-master, not with a long running yarn session so I don't really know
if they could mix up.

I will repeat the test with a cleaned state because we saw that killing the
job with yarn application -kill left the "flink run" process alive so that
may be the problem. We just noticed a few minutes ago.

If the problem persists, I will eventually come back with a full log.

Thanks for now,

Simone

2016-03-16 18:04 GMT+01:00 Ufuk Celebi <u...@apache.org>:

> Hey Simone,
>
> from the logs it looks like multiple jobs have been submitted to the
> cluster, not just one. The different files correspond to different
> jobs recovering. The filtered logs show three jobs running/recovering
> (with IDs 10d8ccae6e87ac56bf763caf4bc4742f,
> 124f29322f9026ac1b35435d5de9f625, 7f280b38065eaa6335f5c3de4fc82547).
>
> Did you manually re-submit the job after killing a job manager?
>
> Regarding the counts, it can happen that they are rolled back to a
> previous consistent state if the checkpoint was not completed yet
> (including the write to ZooKeeper). In that case the job state will be
> rolled back to an earlier consistent state.
>
> Can you please share the complete job manager logs of your program?
> The most helpful thing will be to have a log for each started job
> manager container. I don't know if that is easily possible.
>
> – Ufuk
>
> On Wed, Mar 16, 2016 at 4:12 PM, Simone Robutti
> <simone.robu...@radicalbit.io> wrote:
> > This is the log filtered to check messages from
> > ZooKeeperCompletedCheckpointStore.
> >
> > https://gist.github.com/chobeat/0222b31b87df3fa46a23
> >
> > It looks like it finds only a checkpoint but I'm not sure if the
> different
> > hashes and IDs of the checkpoints are meaningful or not.
> >
> >
> >
> > 2016-03-16 15:33 GMT+01:00 Ufuk Celebi <u...@apache.org>:
> >>
> >> Can you please have a look into the JobManager log file and report
> >> which checkpoints are restored? You should see messages from
> >> ZooKeeperCompletedCheckpointStore like:
> >> - Found X checkpoints in ZooKeeper
> >> - Initialized with X. Removing all older checkpoints
> >>
> >> You can share the complete job manager log file as well if you like.
> >>
> >> – Ufuk
> >>
> >> On Wed, Mar 16, 2016 at 2:50 PM, Simone Robutti
> >> <simone.robu...@radicalbit.io> wrote:
> >> > Hello,
> >> >
> >> > I'm testing the checkpointing functionality with hdfs as a backend.
> >> >
> >> > For what I can see it uses different checkpointing files and resume
> the
> >> > computation from different points and not from the latest available.
> >> > This is
> >> > to me an unexpected behaviour.
> >> >
> >> > I log every second, for every worker, a counter that is increased by 1
> >> > at
> >> > each step.
> >> >
> >> > So for example on node-1 the count goes up to 5, then I kill a job
> >> > manager
> >> > or task manager and it resumes from 5 or 4 and it's ok. The next time
> I
> >> > kill
> >> > a job manager the count is at 15 and it resumes at 14 or 15. Sometimes
> >> > it
> >> > may happen that at a third kill the work resumes at 4 or 5 as if the
> >> > checkpoint resumed the second time wasn't there.
> >> >
> >> > Once I even saw it jump forward: the first kill is at 10 and it
> resumes
> >> > at
> >> > 9, the second kill is at 70 and it resumes at 9, the third kill is at
> 15
> >> > but
> >> > it resumes at 69 as if it resumed from the second kill checkpoint.
> >> >
> >> > This is clearly inconsistent.
> >> >
> >> > Also, in the logs I can find that sometimes it uses a checkpoint file
> >> > different from the previous, consistent resume.
> >> >
> >> > What am I doing wrong? Is it a known bug?
> >
> >
>

Reply via email to