Hi devs, I would like to start a discussion on FLIP-XXX: Independent Checkpoint Based On Pipeline Region.
In high-parallelism streaming jobs, a single Task's checkpoint failure causes the entire global Checkpoint to abort, leading to degraded checkpoint success rates and wasted compute resources (especially for GPU operators). We propose Regional Checkpoint: when some Regions fail to checkpoint, the framework combines the historical state of the failed Regions with the current state of the healthy Regions to produce a logically complete Completed Checkpoint — while preserving state consistency. The key changes are: 1. Snapshot Collection — Allow partial region failures; combine last successful state of failed Regions with current state of normal Regions. 2. State Correction — New checkpointCoordinatorForRegionFallback interface for OperatorCoordinators to produce consistent snapshots against the mixed view. 3. Checkpoint Store — Track ref_checkpoint_id in metadata to prevent premature cleanup of referenced historical checkpoints. The detailed design is described in the FLIP document: https://docs.google.com/document/d/153r9NjHN9xgFUBdZ8sNX6YjUWTREtDMv5i-JaMdE6NU/edit?usp=sharing Looking forward to your feedback! Best regards, Raorao Xiong
