Hi Rui, thanks for driving this! This would be a very useful addition to the Unaligned Checkpoints.
I have no comments on the proposal as we already discussed it offline, Looking forward to it being implemented and released! Regards, Roman On Thu, Sep 11, 2025 at 3:52 PM Gabor Somogyi <[email protected]> wrote: > Hi Rui, > > The proposal describes the problem and plan in a detailed way, +1 on > addressing this. I've couple of questions: > - We see that couple of workloads require heavy disk usage already. Are > there any numbers what additional spilling would mean when buffers > exhausted? > Some sort of ratio would be also good. > - Is it planned to opt for slower memory-only recovery after a declared > maximum disk usage exceeded? I can imagine situations where > memory and disk filled quickly which will blow things up and stays in an > infinite loop (huge state + rescale). > > BR, > G > > > On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <[email protected]> wrote: > > > Hey everyone, > > > > I would like to start a discussion about FLIP-547: Support checkpoint > > during recovery [1]. > > > > Currently, when a Flink job recovers from an unaligned checkpoint, it > > cannot trigger a new checkpoint until the entire recovery process is > > complete. For state-heavy or computationally intensive jobs, this > recovery > > phase can be very slow, sometimes lasting for hours. > > > > This limitation introduces significant challenges. It can block upstream > > and downstream systems, and any interruption (like another failure or a > > rescaling event) during this long recovery period causes the job to lose > > all progress and revert to the last successful checkpoint. This severely > > impacts the reliability and operational efficiency of long-running, > > large-scale jobs. > > > > This proposal aims to solve these problems by allowing checkpoints to be > > taken *during* the recovery phase. This would allow a job to periodically > > save its restored progress, making the recovery process itself > > fault-tolerant. Adopting this feature will make Flink more robust, > improve > > reliability for demanding workloads, and strengthen processing guarantees > > like exactly-once semantics. > > Looking forward to feedback! > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery > > > > Best, > > Rui > > >
