Hi Raorao,

Thanks for this proposal! The Regional Checkpoint addresses a pain point
for AI workloads and large-scaled ETL workload on Flink. I have several
questions here:

1. Why we can do this — might need a bit more elaboration. Based on my
understanding, the key concept is: "a snapshot is valid if it matches some
instantaneous state that could actually occur during execution" — since
each Region consumes independently at different rates and OperatorState
elements advance independently per Region, merging states from different
checkpoints still forms a logically consistent snapshot.

2. I'd echo Zihao's q1 about CheckpointListener. Should and how failed
Regions receive a checkpoint aborted notification for cp100? Is it
semantically correct?

3. Regional Checkpoint's availability heavily depends on ecosystem
components' cooperation. Of course we don't want to ship a feature nobody
supports. So some commonly used connectors and related projects' custom
OperatorCoordinators might need a brief discussion:

- e.g., Kafka-like unbounded split recovery uses offset in operator state
(different from the FLIP's described path), which logically should also be
fine.

- Also, while flink-connector-xxx rarely has OperatorCoordinator
implementations, some closely related projects like flink-cdc, paimon and
fluss do have them. I think it's worth briefly assessing whether they can
support this.

Overall, this is a promising direction. Looking forward to the discussion!

On Thu, May 28, 2026 at 4:49 PM zihao chen <[email protected]> wrote:

> Hi Raorao,
>
> Thanks for driving this FLIP. It's a valuable step toward making
> checkpointing more robust. I have a few questions from my side:
>
> 1. Should CheckpointListener be extended to carry information about
> region-level checkpoints? I'm a bit concerned that reusing it as-is under
> region checkpoint may lead to subtle misunderstandings on the implementer
> side, maybe like "all tasks of the job have completed the checkpoint".
>
> 2. The current discussion focuses on region-level failures.
> How do you envision the case where a specific task simply fails to
> acknowledge the checkpoint for a long period (e.g. due to a transient
> network issue)?
> Should the ongoing checkpoint(such ckp-100 in your example) also
> reference the state from the previous successful checkpoint for that task?
>
> Looking forward to your feedback.
>
> Best,
> Zihao
>
> 熊饶饶 <[email protected]> 于2026年5月27日周三 16:32写道:
>
> > Hi devs,
> >
> > I would like to start a discussion on FLIP-XXX: Independent Checkpoint
> > Based On Pipeline Region.
> >
> > In high-parallelism streaming jobs, a single Task's checkpoint failure
> > causes the entire global Checkpoint to abort, leading to degraded
> > checkpoint success rates and wasted compute resources (especially for GPU
> > operators).
> >
> > We propose Regional Checkpoint: when some Regions fail to checkpoint, the
> > framework combines the historical state of the failed Regions with the
> > current state of the healthy Regions to produce a logically complete
> > Completed Checkpoint — while preserving state consistency. The key
> changes
> > are:
> >
> > 1. Snapshot Collection — Allow partial region failures; combine last
> > successful state of failed Regions with current state of normal Regions.
> >
> > 2. State Correction — New checkpointCoordinatorForRegionFallback
> interface
> > for OperatorCoordinators to produce consistent snapshots against the
> mixed
> > view.
> >
> > 3. Checkpoint Store — Track ref_checkpoint_id in metadata to prevent
> > premature cleanup of referenced historical checkpoints.
> >
> > The detailed design is described in the FLIP document:
> >
> >
> https://docs.google.com/document/d/153r9NjHN9xgFUBdZ8sNX6YjUWTREtDMv5i-JaMdE6NU/edit?usp=sharing
> >
> > Looking forward to your feedback!
> >
> > Best regards,
> >
> > Raorao Xiong
>

Reply via email to