Hi devs,

I would like to start a discussion on FLIP-XXX: Independent Checkpoint Based On 
Pipeline Region.

In high-parallelism streaming jobs, a single Task's checkpoint failure causes 
the entire global Checkpoint to abort, leading to degraded checkpoint success 
rates and wasted compute resources (especially for GPU operators).

We propose Regional Checkpoint: when some Regions fail to checkpoint, the 
framework combines the historical state of the failed Regions with the current 
state of the healthy Regions to produce a logically complete Completed 
Checkpoint — while preserving state consistency. The key changes are:

1. Snapshot Collection — Allow partial region failures; combine last successful 
state of failed Regions with current state of normal Regions.

2. State Correction — New checkpointCoordinatorForRegionFallback interface for 
OperatorCoordinators to produce consistent snapshots against the mixed view.

3. Checkpoint Store — Track ref_checkpoint_id in metadata to prevent premature 
cleanup of referenced historical checkpoints.

The detailed design is described in the FLIP document: 
https://docs.google.com/document/d/153r9NjHN9xgFUBdZ8sNX6YjUWTREtDMv5i-JaMdE6NU/edit?usp=sharing

Looking forward to your feedback!

Best regards,

Raorao Xiong

Reply via email to