[ https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935691#comment-16935691 ]
Zhu Zhu commented on FLINK-14164: --------------------------------- [~wind_ljy] Thanks for offering to do this. I think this requires some knowledge on the scheduling and failover implementations. You can take it if you are prepared. numberOfFailures/numberOfRestarts are the names I've come up with but not yet decided which one is better. And the metric is meant to show the count of failovers that happened, which indicates issues happening. However, failed tasks count can be useful to show the impact of failovers, so maybe we can also have it as numberOfTasksRestarted. [~trohrmann] [~gjy] what's your opinion? > Add a metric to show failover count regarding fine grained recovery > ------------------------------------------------------------------- > > Key: FLINK-14164 > URL: https://issues.apache.org/jira/browse/FLINK-14164 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Metrics > Affects Versions: 1.9.0, 1.10.0 > Reporter: Zhu Zhu > Priority: Major > Fix For: 1.10.0 > > > Previously Flink uses restart all strategy to recover jobs from failures. And > the metric "fullRestart" is used to show the count of failovers. > However, with fine grained recovery introduced in 1.9.0, the "fullRestart" > metric only reveals how many times the entire graph has been restarted, not > including the number of fine grained failure recoveries. > As many users want to build their job alerting based on failovers, I'd > propose to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} > which also respects fine grained recoveries. -- This message was sent by Atlassian Jira (v8.3.4#803005)