[jira] [Created] (FLINK-20912) Increase Log and Metric: Time consumed by Checkpoint Restore

fanrui (Jira) Sun, 10 Jan 2021 01:06:07 -0800

fanrui created FLINK-20912:
------------------------------

             Summary: Increase Log and Metric: Time consumed by Checkpoint 
Restore
                 Key: FLINK-20912
                 URL: https://issues.apache.org/jira/browse/FLINK-20912
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing, Runtime / State Backends
    Affects Versions: 1.13.0, 1.12.1
            Reporter: fanrui



In a production environment, some jobs with higher SLAs need to be restarted 
quickly if failover occurs. Checkpoint restore is an important part of task 
start. When the Flink task starts slowly, the related Log and Metric should be 
added to facilitate troubleshooting.

For example: ByteDance shared in FFA 2020: They made OperatorState parallelized 
restore. Without these metrics, there will be two problems:
1. It is not easy to find the problem. If the task starts slowly, it is not 
known whether the root cause is the slow Checkpoint restore.
2. If optimized, how much speed has been improved for restore? Need to be 
quantified.

I believe that many companies have made relevant metrics in their internal 
Flink versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-20912) Increase Log and Metric: Time consumed by Checkpoint Restore

Reply via email to