Gyula Fora created FLINK-12373:
----------------------------------

             Summary: Improve checkpointing metrics
                 Key: FLINK-12373
                 URL: https://issues.apache.org/jira/browse/FLINK-12373
             Project: Flink
          Issue Type: New Feature
          Components: Runtime / Checkpointing
            Reporter: Gyula Fora


The checkpoint metrics encapsulated in the CheckpointMetrics class currently 
exposes 4 core metrics for each operator: bytesBuffered, alignment time, sync 
duration and async duration

I think it would be a great improvement to break up the tracking of the sync 
duration into the different components as it contains information that is 
critical to improve the SLA of large jobs.

I suggest we break up the sync duration into 4 subcomponents:

 1. prepareSnapshotPreBarrier
 2. Snapshot timers
 3. Snapshot operator states
 4. Sync keyed state checkpoint

Maybe the operator state part could be further broken up into keyed/non-keyed 
part, i dont know.

I think knowing these metrics is crucial for users to minimise the latency 
caused by checkpointing.

Whether we want to show all this info on the web ui is another discussion :)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to