Hi Raorao, Thanks for this proposal! The Regional Checkpoint addresses a pain point for AI workloads and large-scaled ETL workload on Flink. I have several questions here:
1. Why we can do this — might need a bit more elaboration. Based on my understanding, the key concept is: "a snapshot is valid if it matches some instantaneous state that could actually occur during execution" — since each Region consumes independently at different rates and OperatorState elements advance independently per Region, merging states from different checkpoints still forms a logically consistent snapshot. 2. I'd echo Zihao's q1 about CheckpointListener. Should and how failed Regions receive a checkpoint aborted notification for cp100? Is it semantically correct? 3. Regional Checkpoint's availability heavily depends on ecosystem components' cooperation. Of course we don't want to ship a feature nobody supports. So some commonly used connectors and related projects' custom OperatorCoordinators might need a brief discussion: - e.g., Kafka-like unbounded split recovery uses offset in operator state (different from the FLIP's described path), which logically should also be fine. - Also, while flink-connector-xxx rarely has OperatorCoordinator implementations, some closely related projects like flink-cdc, paimon and fluss do have them. I think it's worth briefly assessing whether they can support this. Overall, this is a promising direction. Looking forward to the discussion! On Thu, May 28, 2026 at 4:49 PM zihao chen <[email protected]> wrote: > Hi Raorao, > > Thanks for driving this FLIP. It's a valuable step toward making > checkpointing more robust. I have a few questions from my side: > > 1. Should CheckpointListener be extended to carry information about > region-level checkpoints? I'm a bit concerned that reusing it as-is under > region checkpoint may lead to subtle misunderstandings on the implementer > side, maybe like "all tasks of the job have completed the checkpoint". > > 2. The current discussion focuses on region-level failures. > How do you envision the case where a specific task simply fails to > acknowledge the checkpoint for a long period (e.g. due to a transient > network issue)? > Should the ongoing checkpoint(such ckp-100 in your example) also > reference the state from the previous successful checkpoint for that task? > > Looking forward to your feedback. > > Best, > Zihao > > 熊饶饶 <[email protected]> 于2026年5月27日周三 16:32写道: > > > Hi devs, > > > > I would like to start a discussion on FLIP-XXX: Independent Checkpoint > > Based On Pipeline Region. > > > > In high-parallelism streaming jobs, a single Task's checkpoint failure > > causes the entire global Checkpoint to abort, leading to degraded > > checkpoint success rates and wasted compute resources (especially for GPU > > operators). > > > > We propose Regional Checkpoint: when some Regions fail to checkpoint, the > > framework combines the historical state of the failed Regions with the > > current state of the healthy Regions to produce a logically complete > > Completed Checkpoint — while preserving state consistency. The key > changes > > are: > > > > 1. Snapshot Collection — Allow partial region failures; combine last > > successful state of failed Regions with current state of normal Regions. > > > > 2. State Correction — New checkpointCoordinatorForRegionFallback > interface > > for OperatorCoordinators to produce consistent snapshots against the > mixed > > view. > > > > 3. Checkpoint Store — Track ref_checkpoint_id in metadata to prevent > > premature cleanup of referenced historical checkpoints. > > > > The detailed design is described in the FLIP document: > > > > > https://docs.google.com/document/d/153r9NjHN9xgFUBdZ8sNX6YjUWTREtDMv5i-JaMdE6NU/edit?usp=sharing > > > > Looking forward to your feedback! > > > > Best regards, > > > > Raorao Xiong >
