fanrui created FLINK-20912:
------------------------------
Summary: Increase Log and Metric: Time consumed by Checkpoint
Restore
Key: FLINK-20912
URL: https://issues.apache.org/jira/browse/FLINK-20912
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing, Runtime / State Backends
Affects Versions: 1.13.0, 1.12.1
Reporter: fanrui
In a production environment, some jobs with higher SLAs need to be restarted
quickly if failover occurs. Checkpoint restore is an important part of task
start. When the Flink task starts slowly, the related Log and Metric should be
added to facilitate troubleshooting.
For example: ByteDance shared in FFA 2020: They made OperatorState parallelized
restore. Without these metrics, there will be two problems:
1. It is not easy to find the problem. If the task starts slowly, it is not
known whether the root cause is the slow Checkpoint restore.
2. If optimized, how much speed has been improved for restore? Need to be
quantified.
I believe that many companies have made relevant metrics in their internal
Flink versions.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)