Gyula Fora created FLINK-12373: ---------------------------------- Summary: Improve checkpointing metrics Key: FLINK-12373 URL: https://issues.apache.org/jira/browse/FLINK-12373 Project: Flink Issue Type: New Feature Components: Runtime / Checkpointing Reporter: Gyula Fora
The checkpoint metrics encapsulated in the CheckpointMetrics class currently exposes 4 core metrics for each operator: bytesBuffered, alignment time, sync duration and async duration I think it would be a great improvement to break up the tracking of the sync duration into the different components as it contains information that is critical to improve the SLA of large jobs. I suggest we break up the sync duration into 4 subcomponents: 1. prepareSnapshotPreBarrier 2. Snapshot timers 3. Snapshot operator states 4. Sync keyed state checkpoint Maybe the operator state part could be further broken up into keyed/non-keyed part, i dont know. I think knowing these metrics is crucial for users to minimise the latency caused by checkpointing. Whether we want to show all this info on the web ui is another discussion :) -- This message was sent by Atlassian JIRA (v7.6.3#76005)