Hi Rui, thanks for driving this!

This would be a very useful addition to the Unaligned Checkpoints.

I have no comments on the proposal as we already discussed it offline,
Looking forward to it being implemented and released!

Regards,
Roman


On Thu, Sep 11, 2025 at 3:52 PM Gabor Somogyi <[email protected]>
wrote:

> Hi Rui,
>
> The proposal describes the problem and plan in a detailed way, +1 on
> addressing this. I've couple of questions:
> - We see that couple of workloads require heavy disk usage already. Are
> there any numbers what additional spilling would mean when buffers
> exhausted?
> Some sort of ratio would be also good.
> - Is it planned to opt for slower memory-only recovery after a declared
> maximum disk usage exceeded? I can imagine situations where
> memory and disk filled quickly which will blow things up and stays in an
> infinite loop (huge state + rescale).
>
> BR,
> G
>
>
> On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <[email protected]> wrote:
>
> > Hey everyone,
> >
> > I would like to start a discussion about FLIP-547: Support checkpoint
> > during recovery [1].
> >
> > Currently, when a Flink job recovers from an unaligned checkpoint, it
> > cannot trigger a new checkpoint until the entire recovery process is
> > complete. For state-heavy or computationally intensive jobs, this
> recovery
> > phase can be very slow, sometimes lasting for hours.
> >
> > This limitation introduces significant challenges. It can block upstream
> > and downstream systems, and any interruption (like another failure or a
> > rescaling event) during this long recovery period causes the job to lose
> > all progress and revert to the last successful checkpoint. This severely
> > impacts the reliability and operational efficiency of long-running,
> > large-scale jobs.
> >
> > This proposal aims to solve these problems by allowing checkpoints to be
> > taken *during* the recovery phase. This would allow a job to periodically
> > save its restored progress, making the recovery process itself
> > fault-tolerant. Adopting this feature will make Flink more robust,
> improve
> > reliability for demanding workloads, and strengthen processing guarantees
> > like exactly-once semantics.
> > Looking forward to feedback!
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery
> >
> > Best,
> > Rui
> >
>

Reply via email to