Re: [DISCUSS] FLIP-547: Support checkpoint during recovery

Gabor Somogyi Tue, 16 Sep 2025 06:15:16 -0700

I've played a bit with the mentioned 2 scenarios and I agree with with you.
Namely I also don't expect unmanageable additional disk requirements with
this addition.
Later if we see something we still have the possibility to add some limits.


+1 from my side.

BR,
G


On Fri, Sep 12, 2025 at 10:48 AM Rui Fan <[email protected]> wrote:

> Hey Gabor, thanks for your attention and discussion!
>
> > We see that couple of workloads require heavy disk usage already. Are
> > there any numbers what additional spilling would mean when buffers
> > exhausted?
> > Some sort of ratio would be also good.
>
> My primary assessment is that the volume of "channel state" data being
> spilled to disk should generally not be excessive. This is because this
> state originates entirely from in-memory network buffers, and the total
> available disk capacity is typically far greater than the total size of
> these
> memory buffers.
>
> As I see it, there are two main scenarios that could trigger spilling:
>
> Scenario 1: Scaling Down Parallelism
>
> For example, if parallelism is reduced from 100 to 1. The old job
> (with 100 instances) might have a large amount of state held in its
> network buffers. The new, scaled-down job (with 1 instance) has
> significantly less memory allocated for network buffers, which could
> be insufficient to hold the state during recovery, thus causing a spill.
>
> However, I believe this scenario is unlikely in practice. A large amount
> of channel state(is snapshotted by unaligned checkpoint) usually
> indicates high backpressure, and the correct operational response
> would be to scale up, not down. Scaling up would provide more network
> buffer memory, which would prevent, rather than cause, spilling.
>
> Scenario 2: All recovered buffers are restored on the input side
>
> This is a more plausible scenario. Even if the parallelism is unchanged,
> a task's input buffer pool might need to accommodate both its own
> recovered input state and the recovered output state from upstream
> tasks. The combined size of this data could exceed the input pool's
> capacity and trigger spilling.
>
> > Is it planned to opt for slower memory-only recovery after a declared
> > maximum disk usage exceeded? I can imagine situations where
> > memory and disk filled quickly which will blow things up and stays in an
> > infinite loop (huge state + rescale).
>
> Regarding your question about a fallback plan for when disk usage exceeds
> its limit: currently, we do not have such a "slower" memory-only plan in
> place.
>
> The main reason is consistent with the point above: we believe the risk of
> filling the disk is manageable, as the disk capacity is generally much
> larger
> than the potential volume of data from the in-memory network buffers.
>
> However, I completely agree with your suggestion. Implementing such a
> safety valve would be a valuable addition for the future. We will monitor
> for related issues, and if they arise, we'll prioritize this enhancement in
> the
> future.
>
> WDYT?
>
> Best,
> Rui
>
> On Thu, Sep 11, 2025 at 11:07 PM Roman Khachatryan <[email protected]>
> wrote:
>
> > Hi Rui, thanks for driving this!
> >
> > This would be a very useful addition to the Unaligned Checkpoints.
> >
> > I have no comments on the proposal as we already discussed it offline,
> > Looking forward to it being implemented and released!
> >
> > Regards,
> > Roman
> >
> >
> > On Thu, Sep 11, 2025 at 3:52 PM Gabor Somogyi <[email protected]
> >
> > wrote:
> >
> > > Hi Rui,
> > >
> > > The proposal describes the problem and plan in a detailed way, +1 on
> > > addressing this. I've couple of questions:
> > > - We see that couple of workloads require heavy disk usage already. Are
> > > there any numbers what additional spilling would mean when buffers
> > > exhausted?
> > > Some sort of ratio would be also good.
> > > - Is it planned to opt for slower memory-only recovery after a declared
> > > maximum disk usage exceeded? I can imagine situations where
> > > memory and disk filled quickly which will blow things up and stays in
> an
> > > infinite loop (huge state + rescale).
> > >
> > > BR,
> > > G
> > >
> > >
> > > On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <[email protected]> wrote:
> > >
> > > > Hey everyone,
> > > >
> > > > I would like to start a discussion about FLIP-547: Support checkpoint
> > > > during recovery [1].
> > > >
> > > > Currently, when a Flink job recovers from an unaligned checkpoint, it
> > > > cannot trigger a new checkpoint until the entire recovery process is
> > > > complete. For state-heavy or computationally intensive jobs, this
> > > recovery
> > > > phase can be very slow, sometimes lasting for hours.
> > > >
> > > > This limitation introduces significant challenges. It can block
> > upstream
> > > > and downstream systems, and any interruption (like another failure
> or a
> > > > rescaling event) during this long recovery period causes the job to
> > lose
> > > > all progress and revert to the last successful checkpoint. This
> > severely
> > > > impacts the reliability and operational efficiency of long-running,
> > > > large-scale jobs.
> > > >
> > > > This proposal aims to solve these problems by allowing checkpoints to
> > be
> > > > taken *during* the recovery phase. This would allow a job to
> > periodically
> > > > save its restored progress, making the recovery process itself
> > > > fault-tolerant. Adopting this feature will make Flink more robust,
> > > improve
> > > > reliability for demanding workloads, and strengthen processing
> > guarantees
> > > > like exactly-once semantics.
> > > > Looking forward to feedback!
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery
> > > >
> > > > Best,
> > > > Rui
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-547: Support checkpoint during recovery

Reply via email to