[DISCUSS] Checkpointing (partially) failing jobs

Gyula Fóra Sun, 06 Feb 2022 13:22:58 -0800

Hi all!

At the moment checkpointing only works for healthy jobs with all running
(or some finished) tasks. This sounds reasonable in most cases but there
are a few applications where it would make sense to checkpoint failing jobs
as well.


Due to how the checkpointing mechanism works, subgraphs that have a failing
task cannot be checkpointed without violating the exactly-once semantics.
However if the job has multiple independent subgraphs (that are not
connected to each other), even if one subgraph is failing, the other
completely running one could be checkpointed. In these cases the tasks of
the failing subgraph could simply inherit the last successful checkpoint
metadata (before they started failing). This logic would produce a
consistent checkpoint.

The job as a whole could now make stateful progress even if some subgraphs
are constantly failing. This can be very valuable if for some reason the
job has a larger number of independent subgraphs that are expected to fail
every once in a while, or if some subgraphs can have longer downtimes that
would now cause the whole job to stall.

What do you think?

Cheers,
Gyula

[DISCUSS] Checkpointing (partially) failing jobs

Reply via email to