Hello, I'm testing the checkpointing functionality with hdfs as a backend.
For what I can see it uses different checkpointing files and resume the computation from different points and not from the latest available. This is to me an unexpected behaviour. I log every second, for every worker, a counter that is increased by 1 at each step. So for example on node-1 the count goes up to 5, then I kill a job manager or task manager and it resumes from 5 or 4 and it's ok. The next time I kill a job manager the count is at 15 and it resumes at 14 or 15. Sometimes it may happen that at a third kill the work resumes at 4 or 5 as if the checkpoint resumed the second time wasn't there. Once I even saw it jump forward: the first kill is at 10 and it resumes at 9, the second kill is at 70 and it resumes at 9, the third kill is at 15 but it resumes at 69 as if it resumed from the second kill checkpoint. This is clearly inconsistent. Also, in the logs I can find that sometimes it uses a checkpoint file different from the previous, consistent resume. What am I doing wrong? Is it a known bug?