Hi Raorao, Thanks for driving this FLIP. It's a valuable step toward making checkpointing more robust. I have a few questions from my side:
1. Should CheckpointListener be extended to carry information about region-level checkpoints? I'm a bit concerned that reusing it as-is under region checkpoint may lead to subtle misunderstandings on the implementer side, maybe like "all tasks of the job have completed the checkpoint". 2. The current discussion focuses on region-level failures. How do you envision the case where a specific task simply fails to acknowledge the checkpoint for a long period (e.g. due to a transient network issue)? Should the ongoing checkpoint(such ckp-100 in your example) also reference the state from the previous successful checkpoint for that task? Looking forward to your feedback. Best, Zihao 熊饶饶 <[email protected]> 于2026年5月27日周三 16:32写道: > Hi devs, > > I would like to start a discussion on FLIP-XXX: Independent Checkpoint > Based On Pipeline Region. > > In high-parallelism streaming jobs, a single Task's checkpoint failure > causes the entire global Checkpoint to abort, leading to degraded > checkpoint success rates and wasted compute resources (especially for GPU > operators). > > We propose Regional Checkpoint: when some Regions fail to checkpoint, the > framework combines the historical state of the failed Regions with the > current state of the healthy Regions to produce a logically complete > Completed Checkpoint — while preserving state consistency. The key changes > are: > > 1. Snapshot Collection — Allow partial region failures; combine last > successful state of failed Regions with current state of normal Regions. > > 2. State Correction — New checkpointCoordinatorForRegionFallback interface > for OperatorCoordinators to produce consistent snapshots against the mixed > view. > > 3. Checkpoint Store — Track ref_checkpoint_id in metadata to prevent > premature cleanup of referenced historical checkpoints. > > The detailed design is described in the FLIP document: > > https://docs.google.com/document/d/153r9NjHN9xgFUBdZ8sNX6YjUWTREtDMv5i-JaMdE6NU/edit?usp=sharing > > Looking forward to your feedback! > > Best regards, > > Raorao Xiong
