Hey Gabor, thanks for your attention and discussion! > We see that couple of workloads require heavy disk usage already. Are > there any numbers what additional spilling would mean when buffers > exhausted? > Some sort of ratio would be also good.
My primary assessment is that the volume of "channel state" data being spilled to disk should generally not be excessive. This is because this state originates entirely from in-memory network buffers, and the total available disk capacity is typically far greater than the total size of these memory buffers. As I see it, there are two main scenarios that could trigger spilling: Scenario 1: Scaling Down Parallelism For example, if parallelism is reduced from 100 to 1. The old job (with 100 instances) might have a large amount of state held in its network buffers. The new, scaled-down job (with 1 instance) has significantly less memory allocated for network buffers, which could be insufficient to hold the state during recovery, thus causing a spill. However, I believe this scenario is unlikely in practice. A large amount of channel state(is snapshotted by unaligned checkpoint) usually indicates high backpressure, and the correct operational response would be to scale up, not down. Scaling up would provide more network buffer memory, which would prevent, rather than cause, spilling. Scenario 2: All recovered buffers are restored on the input side This is a more plausible scenario. Even if the parallelism is unchanged, a task's input buffer pool might need to accommodate both its own recovered input state and the recovered output state from upstream tasks. The combined size of this data could exceed the input pool's capacity and trigger spilling. > Is it planned to opt for slower memory-only recovery after a declared > maximum disk usage exceeded? I can imagine situations where > memory and disk filled quickly which will blow things up and stays in an > infinite loop (huge state + rescale). Regarding your question about a fallback plan for when disk usage exceeds its limit: currently, we do not have such a "slower" memory-only plan in place. The main reason is consistent with the point above: we believe the risk of filling the disk is manageable, as the disk capacity is generally much larger than the potential volume of data from the in-memory network buffers. However, I completely agree with your suggestion. Implementing such a safety valve would be a valuable addition for the future. We will monitor for related issues, and if they arise, we'll prioritize this enhancement in the future. WDYT? Best, Rui On Thu, Sep 11, 2025 at 11:07 PM Roman Khachatryan <[email protected]> wrote: > Hi Rui, thanks for driving this! > > This would be a very useful addition to the Unaligned Checkpoints. > > I have no comments on the proposal as we already discussed it offline, > Looking forward to it being implemented and released! > > Regards, > Roman > > > On Thu, Sep 11, 2025 at 3:52 PM Gabor Somogyi <[email protected]> > wrote: > > > Hi Rui, > > > > The proposal describes the problem and plan in a detailed way, +1 on > > addressing this. I've couple of questions: > > - We see that couple of workloads require heavy disk usage already. Are > > there any numbers what additional spilling would mean when buffers > > exhausted? > > Some sort of ratio would be also good. > > - Is it planned to opt for slower memory-only recovery after a declared > > maximum disk usage exceeded? I can imagine situations where > > memory and disk filled quickly which will blow things up and stays in an > > infinite loop (huge state + rescale). > > > > BR, > > G > > > > > > On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <[email protected]> wrote: > > > > > Hey everyone, > > > > > > I would like to start a discussion about FLIP-547: Support checkpoint > > > during recovery [1]. > > > > > > Currently, when a Flink job recovers from an unaligned checkpoint, it > > > cannot trigger a new checkpoint until the entire recovery process is > > > complete. For state-heavy or computationally intensive jobs, this > > recovery > > > phase can be very slow, sometimes lasting for hours. > > > > > > This limitation introduces significant challenges. It can block > upstream > > > and downstream systems, and any interruption (like another failure or a > > > rescaling event) during this long recovery period causes the job to > lose > > > all progress and revert to the last successful checkpoint. This > severely > > > impacts the reliability and operational efficiency of long-running, > > > large-scale jobs. > > > > > > This proposal aims to solve these problems by allowing checkpoints to > be > > > taken *during* the recovery phase. This would allow a job to > periodically > > > save its restored progress, making the recovery process itself > > > fault-tolerant. Adopting this feature will make Flink more robust, > > improve > > > reliability for demanding workloads, and strengthen processing > guarantees > > > like exactly-once semantics. > > > Looking forward to feedback! > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery > > > > > > Best, > > > Rui > > > > > >
