Re: [PR] FLIP-547: Support checkpoint during recovery [flink]

via GitHub Thu, 12 Mar 2026 15:35:01 -0700


1996fanrui commented on PR #27639:
URL: https://github.com/apache/flink/pull/27639#issuecomment-4050668220


   # A/B Test Summary: Checkpoint During Recovery
   
   The Task Initialization Time is:
   
   -  ~70s to ~590s for master branch
   - less than 30 ms with this PR regardless scaling up, down or no rescale.
   
   ## Setup
   
   - **Baseline (master)**: Standard Flink recovery — restores full state from 
the initial checkpoint on every rescale, accumulating state across rounds.
   - **Experiment (feature 
[branch](https://github.com/apache/flink/commits/2ce7b692189d7ea3fc40a6f2da5d9ffe0ebb7d25/)
 `38544/checkpointing-during-recovery`)**: Enables checkpointing during 
recovery, so each restart restores from the most recent checkpoint instead of 
the original one.
   - **Benchmark**: 
[UnalignedCheckpointBenchmark](https://github.com/1996fanrui/flink/blob/38544/20260312-03-fix-debug-checkpoint-is-slow-after-rescaling/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/UnalignedCheckpointBenchmark.java)
 — [10 rounds of 
rescale](https://github.com/1996fanrui/flink/blob/38544/20260312-03-fix-debug-checkpoint-is-slow-after-rescaling/flink-examples/flink-examples-streaming/run_benchmark.sh)
 with 3 checkpoints per round.
   - **Scenarios**: No-rescale (5→5), Scale-up (4→6 to 13→15), Scale-down 
(15→16 to 6→7).
   - **Note**: Master branch only ran 2 of 3 scenarios (no-rescale + scale-up). 
The scale-down scenario was aborted after 2+ hours without completing, due to 
the state restoration overhead at high parallelism. The feature branch 
completed all 3 scenarios.
   
   ## Key Results
   
   ### Task Initialization Time (Map Vertex, avg per subtask)
   
   | Scenario | Round | Master (ms) | Feature (ms) | Speedup |
   |----------|-------|-------------|-------------|---------|
   | No-rescale | 2 | 103,489 | 32 | ~3,200x |
   | No-rescale | 5 | 536,459 | 27 | ~19,900x |
   | No-rescale | 10 | 528,006 | 24 | ~22,000x |
   | Scale-up | 2 | 69,935 | 26 | ~2,700x |
   | Scale-up | 5 | 294,950 | 23 | ~12,800x |
   | Scale-up | 10 | 45,937 | 17 | ~2,700x |
   
   On master, the Map vertex initialization time grows from ~100s to ~590s 
(no-rescale) and ~70s to ~350s (scale-up) as state accumulates across rounds, 
because each restart restores from the original checkpoint.
   
   On the feature branch, initialization stays consistently under 50ms across 
all rounds and all scenarios, because each restart restores from the latest 
checkpoint (minimal delta).
   
   ### Checkpoint Duration (avg across 3 checkpoints per round)
   
   | Metric | Master | Feature |
   |--------|--------|---------|
   | No-rescale avg | 26 ms | 25 ms |
   | Scale-up avg | 27 ms | 27 ms |
   | Scale-down avg | N/A (aborted after 2h) | 25 ms |
   
   Checkpoint duration is unaffected by the change — both branches complete 
checkpoints in ~25-27ms on average.
   
   ## Conclusion
   
   Enabling checkpoint during recovery eliminates the state-accumulation 
problem in task initialization. On master, initialization time grows linearly 
with the number of recovery rounds (up to ~10 minutes by round 10), and the 
scale-down scenario couldn't even finish within 2 hours. With the feature 
branch, initialization stays constant at ~25ms regardless of how many rounds 
have elapsed, and all 3 scenarios complete quickly. Checkpoint performance 
remains unchanged.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] FLIP-547: Support checkpoint during recovery [flink]

Reply via email to