I've played a bit with the mentioned 2 scenarios and I agree with with you. Namely I also don't expect unmanageable additional disk requirements with this addition. Later if we see something we still have the possibility to add some limits.
+1 from my side. BR, G On Fri, Sep 12, 2025 at 10:48 AM Rui Fan <[email protected]> wrote: > Hey Gabor, thanks for your attention and discussion! > > > We see that couple of workloads require heavy disk usage already. Are > > there any numbers what additional spilling would mean when buffers > > exhausted? > > Some sort of ratio would be also good. > > My primary assessment is that the volume of "channel state" data being > spilled to disk should generally not be excessive. This is because this > state originates entirely from in-memory network buffers, and the total > available disk capacity is typically far greater than the total size of > these > memory buffers. > > As I see it, there are two main scenarios that could trigger spilling: > > Scenario 1: Scaling Down Parallelism > > For example, if parallelism is reduced from 100 to 1. The old job > (with 100 instances) might have a large amount of state held in its > network buffers. The new, scaled-down job (with 1 instance) has > significantly less memory allocated for network buffers, which could > be insufficient to hold the state during recovery, thus causing a spill. > > However, I believe this scenario is unlikely in practice. A large amount > of channel state(is snapshotted by unaligned checkpoint) usually > indicates high backpressure, and the correct operational response > would be to scale up, not down. Scaling up would provide more network > buffer memory, which would prevent, rather than cause, spilling. > > Scenario 2: All recovered buffers are restored on the input side > > This is a more plausible scenario. Even if the parallelism is unchanged, > a task's input buffer pool might need to accommodate both its own > recovered input state and the recovered output state from upstream > tasks. The combined size of this data could exceed the input pool's > capacity and trigger spilling. > > > Is it planned to opt for slower memory-only recovery after a declared > > maximum disk usage exceeded? I can imagine situations where > > memory and disk filled quickly which will blow things up and stays in an > > infinite loop (huge state + rescale). > > Regarding your question about a fallback plan for when disk usage exceeds > its limit: currently, we do not have such a "slower" memory-only plan in > place. > > The main reason is consistent with the point above: we believe the risk of > filling the disk is manageable, as the disk capacity is generally much > larger > than the potential volume of data from the in-memory network buffers. > > However, I completely agree with your suggestion. Implementing such a > safety valve would be a valuable addition for the future. We will monitor > for related issues, and if they arise, we'll prioritize this enhancement in > the > future. > > WDYT? > > Best, > Rui > > On Thu, Sep 11, 2025 at 11:07 PM Roman Khachatryan <[email protected]> > wrote: > > > Hi Rui, thanks for driving this! > > > > This would be a very useful addition to the Unaligned Checkpoints. > > > > I have no comments on the proposal as we already discussed it offline, > > Looking forward to it being implemented and released! > > > > Regards, > > Roman > > > > > > On Thu, Sep 11, 2025 at 3:52 PM Gabor Somogyi <[email protected] > > > > wrote: > > > > > Hi Rui, > > > > > > The proposal describes the problem and plan in a detailed way, +1 on > > > addressing this. I've couple of questions: > > > - We see that couple of workloads require heavy disk usage already. Are > > > there any numbers what additional spilling would mean when buffers > > > exhausted? > > > Some sort of ratio would be also good. > > > - Is it planned to opt for slower memory-only recovery after a declared > > > maximum disk usage exceeded? I can imagine situations where > > > memory and disk filled quickly which will blow things up and stays in > an > > > infinite loop (huge state + rescale). > > > > > > BR, > > > G > > > > > > > > > On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <[email protected]> wrote: > > > > > > > Hey everyone, > > > > > > > > I would like to start a discussion about FLIP-547: Support checkpoint > > > > during recovery [1]. > > > > > > > > Currently, when a Flink job recovers from an unaligned checkpoint, it > > > > cannot trigger a new checkpoint until the entire recovery process is > > > > complete. For state-heavy or computationally intensive jobs, this > > > recovery > > > > phase can be very slow, sometimes lasting for hours. > > > > > > > > This limitation introduces significant challenges. It can block > > upstream > > > > and downstream systems, and any interruption (like another failure > or a > > > > rescaling event) during this long recovery period causes the job to > > lose > > > > all progress and revert to the last successful checkpoint. This > > severely > > > > impacts the reliability and operational efficiency of long-running, > > > > large-scale jobs. > > > > > > > > This proposal aims to solve these problems by allowing checkpoints to > > be > > > > taken *during* the recovery phase. This would allow a job to > > periodically > > > > save its restored progress, making the recovery process itself > > > > fault-tolerant. Adopting this feature will make Flink more robust, > > > improve > > > > reliability for demanding workloads, and strengthen processing > > guarantees > > > > like exactly-once semantics. > > > > Looking forward to feedback! > > > > > > > > [1] > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery > > > > > > > > Best, > > > > Rui > > > > > > > > > >
