Hi Rui,

The proposal describes the problem and plan in a detailed way, +1 on
addressing this. I've couple of questions:
- We see that couple of workloads require heavy disk usage already. Are
there any numbers what additional spilling would mean when buffers
exhausted?
Some sort of ratio would be also good.
- Is it planned to opt for slower memory-only recovery after a declared
maximum disk usage exceeded? I can imagine situations where
memory and disk filled quickly which will blow things up and stays in an
infinite loop (huge state + rescale).

BR,
G


On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <[email protected]> wrote:

> Hey everyone,
>
> I would like to start a discussion about FLIP-547: Support checkpoint
> during recovery [1].
>
> Currently, when a Flink job recovers from an unaligned checkpoint, it
> cannot trigger a new checkpoint until the entire recovery process is
> complete. For state-heavy or computationally intensive jobs, this recovery
> phase can be very slow, sometimes lasting for hours.
>
> This limitation introduces significant challenges. It can block upstream
> and downstream systems, and any interruption (like another failure or a
> rescaling event) during this long recovery period causes the job to lose
> all progress and revert to the last successful checkpoint. This severely
> impacts the reliability and operational efficiency of long-running,
> large-scale jobs.
>
> This proposal aims to solve these problems by allowing checkpoints to be
> taken *during* the recovery phase. This would allow a job to periodically
> save its restored progress, making the recovery process itself
> fault-tolerant. Adopting this feature will make Flink more robust, improve
> reliability for demanding workloads, and strengthen processing guarantees
> like exactly-once semantics.
> Looking forward to feedback!
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery
>
> Best,
> Rui
>

Reply via email to