Hi Rui, The proposal describes the problem and plan in a detailed way, +1 on addressing this. I've couple of questions: - We see that couple of workloads require heavy disk usage already. Are there any numbers what additional spilling would mean when buffers exhausted? Some sort of ratio would be also good. - Is it planned to opt for slower memory-only recovery after a declared maximum disk usage exceeded? I can imagine situations where memory and disk filled quickly which will blow things up and stays in an infinite loop (huge state + rescale).
BR, G On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <[email protected]> wrote: > Hey everyone, > > I would like to start a discussion about FLIP-547: Support checkpoint > during recovery [1]. > > Currently, when a Flink job recovers from an unaligned checkpoint, it > cannot trigger a new checkpoint until the entire recovery process is > complete. For state-heavy or computationally intensive jobs, this recovery > phase can be very slow, sometimes lasting for hours. > > This limitation introduces significant challenges. It can block upstream > and downstream systems, and any interruption (like another failure or a > rescaling event) during this long recovery period causes the job to lose > all progress and revert to the last successful checkpoint. This severely > impacts the reliability and operational efficiency of long-running, > large-scale jobs. > > This proposal aims to solve these problems by allowing checkpoints to be > taken *during* the recovery phase. This would allow a job to periodically > save its restored progress, making the recovery process itself > fault-tolerant. Adopting this feature will make Flink more robust, improve > reliability for demanding workloads, and strengthen processing guarantees > like exactly-once semantics. > Looking forward to feedback! > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery > > Best, > Rui >
