Hi Raorao,

Thanks for driving this FLIP. It's a valuable step toward making
checkpointing more robust. I have a few questions from my side:

1. Should CheckpointListener be extended to carry information about
region-level checkpoints? I'm a bit concerned that reusing it as-is under
region checkpoint may lead to subtle misunderstandings on the implementer
side, maybe like "all tasks of the job have completed the checkpoint".

2. The current discussion focuses on region-level failures.
How do you envision the case where a specific task simply fails to
acknowledge the checkpoint for a long period (e.g. due to a transient
network issue)?
Should the ongoing checkpoint(such ckp-100 in your example) also
reference the state from the previous successful checkpoint for that task?

Looking forward to your feedback.

Best,
Zihao

熊饶饶 <[email protected]> 于2026年5月27日周三 16:32写道:

> Hi devs,
>
> I would like to start a discussion on FLIP-XXX: Independent Checkpoint
> Based On Pipeline Region.
>
> In high-parallelism streaming jobs, a single Task's checkpoint failure
> causes the entire global Checkpoint to abort, leading to degraded
> checkpoint success rates and wasted compute resources (especially for GPU
> operators).
>
> We propose Regional Checkpoint: when some Regions fail to checkpoint, the
> framework combines the historical state of the failed Regions with the
> current state of the healthy Regions to produce a logically complete
> Completed Checkpoint — while preserving state consistency. The key changes
> are:
>
> 1. Snapshot Collection — Allow partial region failures; combine last
> successful state of failed Regions with current state of normal Regions.
>
> 2. State Correction — New checkpointCoordinatorForRegionFallback interface
> for OperatorCoordinators to produce consistent snapshots against the mixed
> view.
>
> 3. Checkpoint Store — Track ref_checkpoint_id in metadata to prevent
> premature cleanup of referenced historical checkpoints.
>
> The detailed design is described in the FLIP document:
>
> https://docs.google.com/document/d/153r9NjHN9xgFUBdZ8sNX6YjUWTREtDMv5i-JaMdE6NU/edit?usp=sharing
>
> Looking forward to your feedback!
>
> Best regards,
>
> Raorao Xiong

Reply via email to