Re: [DISCUSS] Checkpointing (partially) failing jobs

Caizhi Weng Sun, 06 Feb 2022 18:17:17 -0800

Hi Gyula!

Thanks for raising this discussion. I agree that this will be an
interesting feature but I actually have some doubts about the motivation
and use case. If there are multiple individual subgraphs in the same job,
why not just distribute them to multiple jobs so that each job contains
only one individual graph and can now fail without disturbing the others?



Gyula Fóra <[email protected]> 于2022年2月7日周一 05:22写道：

> Hi all!
>
> At the moment checkpointing only works for healthy jobs with all running
> (or some finished) tasks. This sounds reasonable in most cases but there
> are a few applications where it would make sense to checkpoint failing jobs
> as well.
>
> Due to how the checkpointing mechanism works, subgraphs that have a failing
> task cannot be checkpointed without violating the exactly-once semantics.
> However if the job has multiple independent subgraphs (that are not
> connected to each other), even if one subgraph is failing, the other
> completely running one could be checkpointed. In these cases the tasks of
> the failing subgraph could simply inherit the last successful checkpoint
> metadata (before they started failing). This logic would produce a
> consistent checkpoint.
>
> The job as a whole could now make stateful progress even if some subgraphs
> are constantly failing. This can be very valuable if for some reason the
> job has a larger number of independent subgraphs that are expected to fail
> every once in a while, or if some subgraphs can have longer downtimes that
> would now cause the whole job to stall.
>
> What do you think?
>
> Cheers,
> Gyula
>

Re: [DISCUSS] Checkpointing (partially) failing jobs

Reply via email to