infocusmodereal commented on PR #27527: URL: https://github.com/apache/flink/pull/27527#issuecomment-4110156316
Herw is the follow up with a more production-shaped validation for this patch: This was run from a state derived from actual production checkpoint metadata and object store settings from two running Flink CDC jobs on our internal Kubernetes cluster, with checkpoints stored in Ceph/S3A. I inspected the latest checkpoints of two production jobs. In those checkpoints, the relevant `managed operator state` handled by `OperatorStateRestoreOperation` had: - job A: `395` total operator-state partitions (offsets) - job B: `510` total operator-state partitions (offsets) In both cases, the non-empty operator-state handles were mostly concentrated in a single merged task-owned state object. So the patch-relevant restore pattern is a few hundred sequential offsets in merged operator-state handles. To benchmarked it, I built a synthetic job whose operator list-state shape was derived from the larger real checkpoint (`510` offsets total). I used a Ceph/S3A-backed checkpoint store and prod-like restore settings: - RocksDB state backend - incremental checkpoints - file merging enabled - same S3A/Ceph configuration family as the production jobs I ran `patched` and `baseline` jobs sequentially, and used separate storage prefixes for each run to avoid object-store contention. I added temporary logging in `OperatorStateRestoreOperation` to count performed vs skipped seeks and to measure the restore phase directly. Results: | Load | Offsets | Restore span patched | Restore span baseline | Restore improvement | Deploy->Running patched | Deploy->Running baseline | End-to-end improvement | |---|---:|---:|---:|---:|---:|---:|---:| | 20% | 101 | 634 ms | 1055 ms | 39.9% | 3084 ms | 3344 ms | 7.8% | | 40% | 203 | 630 ms | 1140 ms | 44.7% | 3267 ms | 3351 ms | 2.5% | | 60% | 307 | 1306 ms | 1892 ms | 31.0% | 3562 ms | 4081 ms | 12.7% | | 80% | 409 | 1432 ms | 2089 ms | 31.5% | 4485 ms | 5188 ms | 13.6% | | 100% | 510 | 1624 ms | 2453 ms | 33.8% | 5095 ms | 5889 ms | 13.5% | - The patch behaved exactly as intended: baseline performed one seek per offset, while the patched build reduced that to `0` performed seeks in these runs. - The direct restore-phase improvement was consistent across the full matrix. - The end-to-end task startup improvement was relatively small, but still signicficant at the higher loads. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
