Gyula Fora created FLINK-12373:
----------------------------------
Summary: Improve checkpointing metrics
Key: FLINK-12373
URL: https://issues.apache.org/jira/browse/FLINK-12373
Project: Flink
Issue Type: New Feature
Components: Runtime / Checkpointing
Reporter: Gyula Fora
The checkpoint metrics encapsulated in the CheckpointMetrics class currently
exposes 4 core metrics for each operator: bytesBuffered, alignment time, sync
duration and async duration
I think it would be a great improvement to break up the tracking of the sync
duration into the different components as it contains information that is
critical to improve the SLA of large jobs.
I suggest we break up the sync duration into 4 subcomponents:
1. prepareSnapshotPreBarrier
2. Snapshot timers
3. Snapshot operator states
4. Sync keyed state checkpoint
Maybe the operator state part could be further broken up into keyed/non-keyed
part, i dont know.
I think knowing these metrics is crucial for users to minimise the latency
caused by checkpointing.
Whether we want to show all this info on the web ui is another discussion :)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)