[ https://issues.apache.org/jira/browse/FLINK-14206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938334#comment-16938334 ]
Till Rohrmann commented on FLINK-14206: --------------------------------------- I agree with Zhu Zhu. With fine grained recovery, a full restart is just a special case of the more general partial failover. If enough users are interested in full restarts, then we might keep this metric and let partial failovers be reported by a different metric. But not reporting any failover when using fine grained recovery might be a bit deceiving. > Let fullRestart metric count fine grained restarts as well > ---------------------------------------------------------- > > Key: FLINK-14206 > URL: https://issues.apache.org/jira/browse/FLINK-14206 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.9.0 > Reporter: Zhu Zhu > Priority: Major > Fix For: 1.9.1 > > > With fine grained recovery introduced in 1.9.0, the {{fullRestart}} metric > only counts how many times the entire graph has been restarted, not including > the number of fine grained failure restarts. > As many users leverage this metric for failure detecting monitoring and > alerting, I'd propose to make it also count fine grained failure restarts. > The concrete proposal is: > 1. Add a counter {{numberOfRestartCounter}} in ExecutionGraph to count all > restarts. The counter is not to be registered to metric groups. > 2. Let {{fullRestart}} query the value of the counter, instead of > {{ExecutionGraph#globalModVersion}} > 3. increment {{numberOfRestartCounter}} in {{ExecutionGraph#failGlobal}} > 4. increment {{numberOfRestartCounter}} in > {{ExecutionGraph#notifyExecutionChange}} where notifying the failover > strategy, or maybe in {{AdaptedRestartPipelinedRegionStrategyNG}} to only > count failovers really happened -- This message was sent by Atlassian Jira (v8.3.4#803005)