Re: [PR] [FLINK-39000] [runtime] Avoid redundant seeks during operator list state restore [flink]

via GitHub Mon, 23 Mar 2026 05:12:20 -0700


infocusmodereal commented on PR #27527:
URL: https://github.com/apache/flink/pull/27527#issuecomment-4110156316


   Herw is the follow up with a more production-shaped validation for this 
patch:
   
   This was run from a state derived from actual production checkpoint metadata 
and object store settings from two running Flink CDC jobs on our internal 
Kubernetes cluster, with checkpoints stored in Ceph/S3A.
   
   I inspected the latest checkpoints of two production jobs. In those 
checkpoints, the relevant `managed operator state` handled by 
`OperatorStateRestoreOperation` had:
   
   - job A: `395` total operator-state partitions (offsets)
   - job B: `510` total operator-state partitions (offsets)
   
   In both cases, the non-empty operator-state handles were mostly concentrated 
in a single merged task-owned state object. So the patch-relevant restore 
pattern is a few hundred sequential offsets in merged operator-state handles.
   
   To benchmarked it, I built a synthetic job whose operator list-state shape 
was derived from the larger real checkpoint (`510` offsets total). I used a 
Ceph/S3A-backed checkpoint store and prod-like restore settings:
   
   - RocksDB state backend
   - incremental checkpoints
   - file merging enabled
   - same S3A/Ceph configuration family as the production jobs
   
   I ran `patched` and `baseline` jobs sequentially, and used separate storage 
prefixes for each run to avoid object-store contention. I added temporary 
logging in `OperatorStateRestoreOperation` to count performed vs skipped seeks 
and to measure the restore phase directly.
   
   Results:
   
   | Load | Offsets | Restore span patched | Restore span baseline | Restore 
improvement | Deploy->Running patched | Deploy->Running baseline | End-to-end 
improvement |
   |---|---:|---:|---:|---:|---:|---:|---:|
   | 20% | 101 | 634 ms | 1055 ms | 39.9% | 3084 ms | 3344 ms | 7.8% |
   | 40% | 203 | 630 ms | 1140 ms | 44.7% | 3267 ms | 3351 ms | 2.5% |
   | 60% | 307 | 1306 ms | 1892 ms | 31.0% | 3562 ms | 4081 ms | 12.7% |
   | 80% | 409 | 1432 ms | 2089 ms | 31.5% | 4485 ms | 5188 ms | 13.6% |
   | 100% | 510 | 1624 ms | 2453 ms | 33.8% | 5095 ms | 5889 ms | 13.5% |
   
   - The patch behaved exactly as intended: baseline performed one seek per 
offset, while the patched build reduced that to `0` performed seeks in these 
runs.
   - The direct restore-phase improvement was consistent across the full matrix.
   - The end-to-end task startup improvement was relatively small, but still 
signicficant at the higher loads.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-39000] [runtime] Avoid redundant seeks during operator list state restore [flink]

Reply via email to