Can you please have a look into the JobManager log file and report
which checkpoints are restored? You should see messages from
ZooKeeperCompletedCheckpointStore like:
- Found X checkpoints in ZooKeeper
- Initialized with X. Removing all older checkpoints

You can share the complete job manager log file as well if you like.

– Ufuk

On Wed, Mar 16, 2016 at 2:50 PM, Simone Robutti
<simone.robu...@radicalbit.io> wrote:
> Hello,
>
> I'm testing the checkpointing functionality with hdfs as a backend.
>
> For what I can see it uses different checkpointing files and resume the
> computation from different points and not from the latest available. This is
> to me an unexpected behaviour.
>
> I log every second, for every worker, a counter that is increased by 1 at
> each step.
>
> So for example on node-1 the count goes up to 5, then I kill a job manager
> or task manager and it resumes from 5 or 4 and it's ok. The next time I kill
> a job manager the count is at 15 and it resumes at 14 or 15. Sometimes it
> may happen that at a third kill the work resumes at 4 or 5 as if the
> checkpoint resumed the second time wasn't there.
>
> Once I even saw it jump forward: the first kill is at 10 and it resumes at
> 9, the second kill is at 70 and it resumes at 9, the third kill is at 15 but
> it resumes at 69 as if it resumed from the second kill checkpoint.
>
> This is clearly inconsistent.
>
> Also, in the logs I can find that sometimes it uses a checkpoint file
> different from the previous, consistent resume.
>
> What am I doing wrong? Is it a known bug?

Reply via email to