Hello,

I'm testing the checkpointing functionality with hdfs as a backend.

For what I can see it uses different checkpointing files and resume the
computation from different points and not from the latest available. This
is to me an unexpected behaviour.

I log every second, for every worker, a counter that is increased by 1 at
each step.

So for example on node-1 the count goes up to 5, then I kill a job manager
or task manager and it resumes from 5 or 4 and it's ok. The next time I
kill a job manager the count is at 15 and it resumes at 14 or 15. Sometimes
it may happen that at a third kill the work resumes at 4 or 5 as if the
checkpoint resumed the second time wasn't there.

Once I even saw it jump forward: the first kill is at 10 and it resumes at
9, the second kill is at 70 and it resumes at 9, the third kill is at 15
but it resumes at 69 as if it resumed from the second kill checkpoint.

This is clearly inconsistent.

Also, in the logs I can find that sometimes it uses a checkpoint file
different from the previous, consistent resume.

What am I doing wrong? Is it a known bug?

Reply via email to