Re: [DISCUSS] FLIP-547: Support checkpoint during recovery

Rui Fan Mon, 22 Sep 2025 01:10:16 -0700

Thanks everyone for the feedback!

I would start vote tomorrow if no more comment, thanks


Best,
Rui

On Wed, Sep 17, 2025 at 8:09 AM Zakelly Lan <[email protected]> wrote:

> Hi Rui,
>
> It's a nice addition, and +1 for this optimization. I read through the
> design and have no questions.
>
> Thanks for driving this.
>
>
> Best,
> Zakelly
>
> On Tue, Sep 16, 2025 at 9:15 PM Gabor Somogyi <[email protected]>
> wrote:
>
> > I've played a bit with the mentioned 2 scenarios and I agree with with
> you.
> > Namely I also don't expect unmanageable additional disk requirements with
> > this addition.
> > Later if we see something we still have the possibility to add some
> limits.
> >
> > +1 from my side.
> >
> > BR,
> > G
> >
> >
> > On Fri, Sep 12, 2025 at 10:48 AM Rui Fan <[email protected]> wrote:
> >
> > > Hey Gabor, thanks for your attention and discussion!
> > >
> > > > We see that couple of workloads require heavy disk usage already. Are
> > > > there any numbers what additional spilling would mean when buffers
> > > > exhausted?
> > > > Some sort of ratio would be also good.
> > >
> > > My primary assessment is that the volume of "channel state" data being
> > > spilled to disk should generally not be excessive. This is because this
> > > state originates entirely from in-memory network buffers, and the total
> > > available disk capacity is typically far greater than the total size of
> > > these
> > > memory buffers.
> > >
> > > As I see it, there are two main scenarios that could trigger spilling:
> > >
> > > Scenario 1: Scaling Down Parallelism
> > >
> > > For example, if parallelism is reduced from 100 to 1. The old job
> > > (with 100 instances) might have a large amount of state held in its
> > > network buffers. The new, scaled-down job (with 1 instance) has
> > > significantly less memory allocated for network buffers, which could
> > > be insufficient to hold the state during recovery, thus causing a
> spill.
> > >
> > > However, I believe this scenario is unlikely in practice. A large
> amount
> > > of channel state(is snapshotted by unaligned checkpoint) usually
> > > indicates high backpressure, and the correct operational response
> > > would be to scale up, not down. Scaling up would provide more network
> > > buffer memory, which would prevent, rather than cause, spilling.
> > >
> > > Scenario 2: All recovered buffers are restored on the input side
> > >
> > > This is a more plausible scenario. Even if the parallelism is
> unchanged,
> > > a task's input buffer pool might need to accommodate both its own
> > > recovered input state and the recovered output state from upstream
> > > tasks. The combined size of this data could exceed the input pool's
> > > capacity and trigger spilling.
> > >
> > > > Is it planned to opt for slower memory-only recovery after a declared
> > > > maximum disk usage exceeded? I can imagine situations where
> > > > memory and disk filled quickly which will blow things up and stays in
> > an
> > > > infinite loop (huge state + rescale).
> > >
> > > Regarding your question about a fallback plan for when disk usage
> exceeds
> > > its limit: currently, we do not have such a "slower" memory-only plan
> in
> > > place.
> > >
> > > The main reason is consistent with the point above: we believe the risk
> > of
> > > filling the disk is manageable, as the disk capacity is generally much
> > > larger
> > > than the potential volume of data from the in-memory network buffers.
> > >
> > > However, I completely agree with your suggestion. Implementing such a
> > > safety valve would be a valuable addition for the future. We will
> monitor
> > > for related issues, and if they arise, we'll prioritize this
> enhancement
> > in
> > > the
> > > future.
> > >
> > > WDYT?
> > >
> > > Best,
> > > Rui
> > >
> > > On Thu, Sep 11, 2025 at 11:07 PM Roman Khachatryan <[email protected]>
> > > wrote:
> > >
> > > > Hi Rui, thanks for driving this!
> > > >
> > > > This would be a very useful addition to the Unaligned Checkpoints.
> > > >
> > > > I have no comments on the proposal as we already discussed it
> offline,
> > > > Looking forward to it being implemented and released!
> > > >
> > > > Regards,
> > > > Roman
> > > >
> > > >
> > > > On Thu, Sep 11, 2025 at 3:52 PM Gabor Somogyi <
> > [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > Hi Rui,
> > > > >
> > > > > The proposal describes the problem and plan in a detailed way, +1
> on
> > > > > addressing this. I've couple of questions:
> > > > > - We see that couple of workloads require heavy disk usage already.
> > Are
> > > > > there any numbers what additional spilling would mean when buffers
> > > > > exhausted?
> > > > > Some sort of ratio would be also good.
> > > > > - Is it planned to opt for slower memory-only recovery after a
> > declared
> > > > > maximum disk usage exceeded? I can imagine situations where
> > > > > memory and disk filled quickly which will blow things up and stays
> in
> > > an
> > > > > infinite loop (huge state + rescale).
> > > > >
> > > > > BR,
> > > > > G
> > > > >
> > > > >
> > > > > On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <[email protected]>
> > wrote:
> > > > >
> > > > > > Hey everyone,
> > > > > >
> > > > > > I would like to start a discussion about FLIP-547: Support
> > checkpoint
> > > > > > during recovery [1].
> > > > > >
> > > > > > Currently, when a Flink job recovers from an unaligned
> checkpoint,
> > it
> > > > > > cannot trigger a new checkpoint until the entire recovery process
> > is
> > > > > > complete. For state-heavy or computationally intensive jobs, this
> > > > > recovery
> > > > > > phase can be very slow, sometimes lasting for hours.
> > > > > >
> > > > > > This limitation introduces significant challenges. It can block
> > > > upstream
> > > > > > and downstream systems, and any interruption (like another
> failure
> > > or a
> > > > > > rescaling event) during this long recovery period causes the job
> to
> > > > lose
> > > > > > all progress and revert to the last successful checkpoint. This
> > > > severely
> > > > > > impacts the reliability and operational efficiency of
> long-running,
> > > > > > large-scale jobs.
> > > > > >
> > > > > > This proposal aims to solve these problems by allowing
> checkpoints
> > to
> > > > be
> > > > > > taken *during* the recovery phase. This would allow a job to
> > > > periodically
> > > > > > save its restored progress, making the recovery process itself
> > > > > > fault-tolerant. Adopting this feature will make Flink more
> robust,
> > > > > improve
> > > > > > reliability for demanding workloads, and strengthen processing
> > > > guarantees
> > > > > > like exactly-once semantics.
> > > > > > Looking forward to feedback!
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery
> > > > > >
> > > > > > Best,
> > > > > > Rui
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-547: Support checkpoint during recovery

Reply via email to