Hi all! At the moment checkpointing only works for healthy jobs with all running (or some finished) tasks. This sounds reasonable in most cases but there are a few applications where it would make sense to checkpoint failing jobs as well.
Due to how the checkpointing mechanism works, subgraphs that have a failing task cannot be checkpointed without violating the exactly-once semantics. However if the job has multiple independent subgraphs (that are not connected to each other), even if one subgraph is failing, the other completely running one could be checkpointed. In these cases the tasks of the failing subgraph could simply inherit the last successful checkpoint metadata (before they started failing). This logic would produce a consistent checkpoint. The job as a whole could now make stateful progress even if some subgraphs are constantly failing. This can be very valuable if for some reason the job has a larger number of independent subgraphs that are expected to fail every once in a while, or if some subgraphs can have longer downtimes that would now cause the whole job to stall. What do you think? Cheers, Gyula