1996fanrui commented on PR #27639: URL: https://github.com/apache/flink/pull/27639#issuecomment-4050668220
# A/B Test Summary: Checkpoint During Recovery The Task Initialization Time is: - ~70s to ~590s for master branch - less than 30 ms with this PR regardless scaling up, down or no rescale. ## Setup - **Baseline (master)**: Standard Flink recovery — restores full state from the initial checkpoint on every rescale, accumulating state across rounds. - **Experiment (feature [branch](https://github.com/apache/flink/commits/2ce7b692189d7ea3fc40a6f2da5d9ffe0ebb7d25/) `38544/checkpointing-during-recovery`)**: Enables checkpointing during recovery, so each restart restores from the most recent checkpoint instead of the original one. - **Benchmark**: [UnalignedCheckpointBenchmark](https://github.com/1996fanrui/flink/blob/38544/20260312-03-fix-debug-checkpoint-is-slow-after-rescaling/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/UnalignedCheckpointBenchmark.java) — [10 rounds of rescale](https://github.com/1996fanrui/flink/blob/38544/20260312-03-fix-debug-checkpoint-is-slow-after-rescaling/flink-examples/flink-examples-streaming/run_benchmark.sh) with 3 checkpoints per round. - **Scenarios**: No-rescale (5→5), Scale-up (4→6 to 13→15), Scale-down (15→16 to 6→7). - **Note**: Master branch only ran 2 of 3 scenarios (no-rescale + scale-up). The scale-down scenario was aborted after 2+ hours without completing, due to the state restoration overhead at high parallelism. The feature branch completed all 3 scenarios. ## Key Results ### Task Initialization Time (Map Vertex, avg per subtask) | Scenario | Round | Master (ms) | Feature (ms) | Speedup | |----------|-------|-------------|-------------|---------| | No-rescale | 2 | 103,489 | 32 | ~3,200x | | No-rescale | 5 | 536,459 | 27 | ~19,900x | | No-rescale | 10 | 528,006 | 24 | ~22,000x | | Scale-up | 2 | 69,935 | 26 | ~2,700x | | Scale-up | 5 | 294,950 | 23 | ~12,800x | | Scale-up | 10 | 45,937 | 17 | ~2,700x | On master, the Map vertex initialization time grows from ~100s to ~590s (no-rescale) and ~70s to ~350s (scale-up) as state accumulates across rounds, because each restart restores from the original checkpoint. On the feature branch, initialization stays consistently under 50ms across all rounds and all scenarios, because each restart restores from the latest checkpoint (minimal delta). ### Checkpoint Duration (avg across 3 checkpoints per round) | Metric | Master | Feature | |--------|--------|---------| | No-rescale avg | 26 ms | 25 ms | | Scale-up avg | 27 ms | 27 ms | | Scale-down avg | N/A (aborted after 2h) | 25 ms | Checkpoint duration is unaffected by the change — both branches complete checkpoints in ~25-27ms on average. ## Conclusion Enabling checkpoint during recovery eliminates the state-accumulation problem in task initialization. On master, initialization time grows linearly with the number of recovery rounds (up to ~10 minutes by round 10), and the scale-down scenario couldn't even finish within 2 hours. With the feature branch, initialization stays constant at ~25ms regardless of how many rounds have elapsed, and all 3 scenarios complete quickly. Checkpoint performance remains unchanged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
