Hey Gabor, thanks for your attention and discussion!

> We see that couple of workloads require heavy disk usage already. Are
> there any numbers what additional spilling would mean when buffers
> exhausted?
> Some sort of ratio would be also good.

My primary assessment is that the volume of "channel state" data being
spilled to disk should generally not be excessive. This is because this
state originates entirely from in-memory network buffers, and the total
available disk capacity is typically far greater than the total size of
these
memory buffers.

As I see it, there are two main scenarios that could trigger spilling:

Scenario 1: Scaling Down Parallelism

For example, if parallelism is reduced from 100 to 1. The old job
(with 100 instances) might have a large amount of state held in its
network buffers. The new, scaled-down job (with 1 instance) has
significantly less memory allocated for network buffers, which could
be insufficient to hold the state during recovery, thus causing a spill.

However, I believe this scenario is unlikely in practice. A large amount
of channel state(is snapshotted by unaligned checkpoint) usually
indicates high backpressure, and the correct operational response
would be to scale up, not down. Scaling up would provide more network
buffer memory, which would prevent, rather than cause, spilling.

Scenario 2: All recovered buffers are restored on the input side

This is a more plausible scenario. Even if the parallelism is unchanged,
a task's input buffer pool might need to accommodate both its own
recovered input state and the recovered output state from upstream
tasks. The combined size of this data could exceed the input pool's
capacity and trigger spilling.

> Is it planned to opt for slower memory-only recovery after a declared
> maximum disk usage exceeded? I can imagine situations where
> memory and disk filled quickly which will blow things up and stays in an
> infinite loop (huge state + rescale).

Regarding your question about a fallback plan for when disk usage exceeds
its limit: currently, we do not have such a "slower" memory-only plan in
place.

The main reason is consistent with the point above: we believe the risk of
filling the disk is manageable, as the disk capacity is generally much
larger
than the potential volume of data from the in-memory network buffers.

However, I completely agree with your suggestion. Implementing such a
safety valve would be a valuable addition for the future. We will monitor
for related issues, and if they arise, we'll prioritize this enhancement in
the
future.

WDYT?

Best,
Rui

On Thu, Sep 11, 2025 at 11:07 PM Roman Khachatryan <[email protected]> wrote:

> Hi Rui, thanks for driving this!
>
> This would be a very useful addition to the Unaligned Checkpoints.
>
> I have no comments on the proposal as we already discussed it offline,
> Looking forward to it being implemented and released!
>
> Regards,
> Roman
>
>
> On Thu, Sep 11, 2025 at 3:52 PM Gabor Somogyi <[email protected]>
> wrote:
>
> > Hi Rui,
> >
> > The proposal describes the problem and plan in a detailed way, +1 on
> > addressing this. I've couple of questions:
> > - We see that couple of workloads require heavy disk usage already. Are
> > there any numbers what additional spilling would mean when buffers
> > exhausted?
> > Some sort of ratio would be also good.
> > - Is it planned to opt for slower memory-only recovery after a declared
> > maximum disk usage exceeded? I can imagine situations where
> > memory and disk filled quickly which will blow things up and stays in an
> > infinite loop (huge state + rescale).
> >
> > BR,
> > G
> >
> >
> > On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <[email protected]> wrote:
> >
> > > Hey everyone,
> > >
> > > I would like to start a discussion about FLIP-547: Support checkpoint
> > > during recovery [1].
> > >
> > > Currently, when a Flink job recovers from an unaligned checkpoint, it
> > > cannot trigger a new checkpoint until the entire recovery process is
> > > complete. For state-heavy or computationally intensive jobs, this
> > recovery
> > > phase can be very slow, sometimes lasting for hours.
> > >
> > > This limitation introduces significant challenges. It can block
> upstream
> > > and downstream systems, and any interruption (like another failure or a
> > > rescaling event) during this long recovery period causes the job to
> lose
> > > all progress and revert to the last successful checkpoint. This
> severely
> > > impacts the reliability and operational efficiency of long-running,
> > > large-scale jobs.
> > >
> > > This proposal aims to solve these problems by allowing checkpoints to
> be
> > > taken *during* the recovery phase. This would allow a job to
> periodically
> > > save its restored progress, making the recovery process itself
> > > fault-tolerant. Adopting this feature will make Flink more robust,
> > improve
> > > reliability for demanding workloads, and strengthen processing
> guarantees
> > > like exactly-once semantics.
> > > Looking forward to feedback!
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery
> > >
> > > Best,
> > > Rui
> > >
> >
>

Reply via email to