[jira] [Updated] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery
[ https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-14164: - Description: Previously Flink uses restart all strategy to recover jobs from failures. And the metric "fullRestart" is used to show the count of failovers. However, with fine grained recovery introduced in 1.9.0, the "fullRestart" metric only reveals how many times the entire graph has been restarted, not including the number of fine grained failure recoveries. As many users want to build their job alerting based on failovers, I'd propose to add such a new metric {{numberOfRestarts}} which also respects fine grained recoveries. The metric should be a Gauge. was: Previously Flink uses restart all strategy to recover jobs from failures. And the metric "fullRestart" is used to show the count of failovers. However, with fine grained recovery introduced in 1.9.0, the "fullRestart" metric only reveals how many times the entire graph has been restarted, not including the number of fine grained failure recoveries. As many users want to build their job alerting based on failovers, I'd propose to add such a new metric {{numberOfRestarts}} which also respects fine grained recoveries. The metric should be a meter(MeterView) so that users can leverage the rate to detect newly happened failures rather than de deviation by themselves. The MeterView should be added in SchedulerBase to serve both legacy scheduler and ng scheduler. The underlying counter of the MeterView is determined by the scheduler implementations: 1. for legacy scheduler, it's the {{ExecutionGraph#numberOfRestartsCounter}} which was added in FLINK-14206 2. for ng scheduler, it's a new counter added in {{ExecutionFailureHandler}} that counts all the task and global failures notified to it. > Add a metric to show failover count regarding fine grained recovery > --- > > Key: FLINK-14164 > URL: https://issues.apache.org/jira/browse/FLINK-14164 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination, Runtime / Metrics >Affects Versions: 1.10.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Previously Flink uses restart all strategy to recover jobs from failures. And > the metric "fullRestart" is used to show the count of failovers. > However, with fine grained recovery introduced in 1.9.0, the "fullRestart" > metric only reveals how many times the entire graph has been restarted, not > including the number of fine grained failure recoveries. > As many users want to build their job alerting based on failovers, I'd > propose to add such a new metric {{numberOfRestarts}} which also respects > fine grained recoveries. The metric should be a Gauge. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery
[ https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-14164: --- Labels: pull-request-available (was: ) > Add a metric to show failover count regarding fine grained recovery > --- > > Key: FLINK-14164 > URL: https://issues.apache.org/jira/browse/FLINK-14164 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination, Runtime / Metrics >Affects Versions: 1.10.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > > Previously Flink uses restart all strategy to recover jobs from failures. And > the metric "fullRestart" is used to show the count of failovers. > However, with fine grained recovery introduced in 1.9.0, the "fullRestart" > metric only reveals how many times the entire graph has been restarted, not > including the number of fine grained failure recoveries. > As many users want to build their job alerting based on failovers, I'd > propose to add such a new metric {{numberOfRestarts}} which also respects > fine grained recoveries. The metric should be a meter(MeterView) so that > users can leverage the rate to detect newly happened failures rather than de > deviation by themselves. > The MeterView should be added in SchedulerBase to serve both legacy scheduler > and ng scheduler. > The underlying counter of the MeterView is determined by the scheduler > implementations: > 1. for legacy scheduler, it's the {{ExecutionGraph#numberOfRestartsCounter}} > which was added in FLINK-14206 > 2. for ng scheduler, it's a new counter added in {{ExecutionFailureHandler}} > that counts all the task and global failures notified to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery
[ https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhu Zhu updated FLINK-14164: Description: Previously Flink uses restart all strategy to recover jobs from failures. And the metric "fullRestart" is used to show the count of failovers. However, with fine grained recovery introduced in 1.9.0, the "fullRestart" metric only reveals how many times the entire graph has been restarted, not including the number of fine grained failure recoveries. As many users want to build their job alerting based on failovers, I'd propose to add such a new metric {{numberOfRestarts}} which also respects fine grained recoveries. The metric should be a meter(MeterView) so that users can leverage the rate to detect newly happened failures rather than de deviation by themselves. The MeterView should be added in SchedulerBase to serve both legacy scheduler and ng scheduler. The underlying counter of the MeterView is determined by the scheduler implementations: 1. for legacy scheduler, it's the {{ExecutionGraph#numberOfRestartsCounter}} which was added in FLINK-14206 2. for ng scheduler, it's a new counter added in {{ExecutionFailureHandler}} that counts all the task and global failures notified to it. was: Previously Flink uses restart all strategy to recover jobs from failures. And the metric "fullRestart" is used to show the count of failovers. However, with fine grained recovery introduced in 1.9.0, the "fullRestart" metric only reveals how many times the entire graph has been restarted, not including the number of fine grained failure recoveries. As many users want to build their job alerting based on failovers, I'd propose to add such a new metric {{numberOfRestarts}} which also respects fine grained recoveries. The metric should be a MeterView so that users can leverage the rate to detect newly happened failures rather than de deviation by themselves. The MeterView should be registered to SchedulerBase to serve both legacy scheduler and ng scheduler. The underlying counter of the MeterView is determined by the scheduler implementations: 1. for legacy scheduler, it's the {{ExecutionGraph#numberOfRestartsCounter}} which was added in FLINK-14206 2. for ng scheduler, it's a new counter added in {{ExecutionFailureHandler}} that counts all the task and global failures notified to it. > Add a metric to show failover count regarding fine grained recovery > --- > > Key: FLINK-14164 > URL: https://issues.apache.org/jira/browse/FLINK-14164 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination, Runtime / Metrics >Affects Versions: 1.10.0 >Reporter: Zhu Zhu >Priority: Major > Fix For: 1.10.0 > > > Previously Flink uses restart all strategy to recover jobs from failures. And > the metric "fullRestart" is used to show the count of failovers. > However, with fine grained recovery introduced in 1.9.0, the "fullRestart" > metric only reveals how many times the entire graph has been restarted, not > including the number of fine grained failure recoveries. > As many users want to build their job alerting based on failovers, I'd > propose to add such a new metric {{numberOfRestarts}} which also respects > fine grained recoveries. The metric should be a meter(MeterView) so that > users can leverage the rate to detect newly happened failures rather than de > deviation by themselves. > The MeterView should be added in SchedulerBase to serve both legacy scheduler > and ng scheduler. > The underlying counter of the MeterView is determined by the scheduler > implementations: > 1. for legacy scheduler, it's the {{ExecutionGraph#numberOfRestartsCounter}} > which was added in FLINK-14206 > 2. for ng scheduler, it's a new counter added in {{ExecutionFailureHandler}} > that counts all the task and global failures notified to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery
[ https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhu Zhu updated FLINK-14164: Description: Previously Flink uses restart all strategy to recover jobs from failures. And the metric "fullRestart" is used to show the count of failovers. However, with fine grained recovery introduced in 1.9.0, the "fullRestart" metric only reveals how many times the entire graph has been restarted, not including the number of fine grained failure recoveries. As many users want to build their job alerting based on failovers, I'd propose to add such a new metric {{numberOfRestarts}} which also respects fine grained recoveries. The metric should be a MeterView so that users can leverage the rate to detect newly happened failures rather than de deviation by themselves. The MeterView should be registered to SchedulerBase to serve both legacy scheduler and ng scheduler. The underlying counter of the MeterView is determined by the scheduler implementations: 1. for legacy scheduler, it's the {{ExecutionGraph#numberOfRestartsCounter}} which was added in FLINK-14206 2. for ng scheduler, it's a new counter added in {{ExecutionFailureHandler}} that counts all the task and global failures notified to it. was: Previously Flink uses restart all strategy to recover jobs from failures. And the metric "fullRestart" is used to show the count of failovers. However, with fine grained recovery introduced in 1.9.0, the "fullRestart" metric only reveals how many times the entire graph has been restarted, not including the number of fine grained failure recoveries. As many users want to build their job alerting based on failovers, I'd propose to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} which also respects fine grained recoveries. > Add a metric to show failover count regarding fine grained recovery > --- > > Key: FLINK-14164 > URL: https://issues.apache.org/jira/browse/FLINK-14164 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination, Runtime / Metrics >Affects Versions: 1.10.0 >Reporter: Zhu Zhu >Priority: Major > Fix For: 1.10.0 > > > Previously Flink uses restart all strategy to recover jobs from failures. And > the metric "fullRestart" is used to show the count of failovers. > However, with fine grained recovery introduced in 1.9.0, the "fullRestart" > metric only reveals how many times the entire graph has been restarted, not > including the number of fine grained failure recoveries. > As many users want to build their job alerting based on failovers, I'd > propose to add such a new metric {{numberOfRestarts}} which also respects > fine grained recoveries. The metric should be a MeterView so that users can > leverage the rate to detect newly happened failures rather than de deviation > by themselves. > The MeterView should be registered to SchedulerBase to serve both legacy > scheduler and ng scheduler. > The underlying counter of the MeterView is determined by the scheduler > implementations: > 1. for legacy scheduler, it's the {{ExecutionGraph#numberOfRestartsCounter}} > which was added in FLINK-14206 > 2. for ng scheduler, it's a new counter added in {{ExecutionFailureHandler}} > that counts all the task and global failures notified to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery
[ https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhu Zhu updated FLINK-14164: Parent: FLINK-10429 Issue Type: Sub-task (was: Improvement) > Add a metric to show failover count regarding fine grained recovery > --- > > Key: FLINK-14164 > URL: https://issues.apache.org/jira/browse/FLINK-14164 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination, Runtime / Metrics >Affects Versions: 1.10.0 >Reporter: Zhu Zhu >Priority: Major > Fix For: 1.10.0 > > > Previously Flink uses restart all strategy to recover jobs from failures. And > the metric "fullRestart" is used to show the count of failovers. > However, with fine grained recovery introduced in 1.9.0, the "fullRestart" > metric only reveals how many times the entire graph has been restarted, not > including the number of fine grained failure recoveries. > As many users want to build their job alerting based on failovers, I'd > propose to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} > which also respects fine grained recoveries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery
[ https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhu Zhu updated FLINK-14164: Affects Version/s: (was: 1.9.0) > Add a metric to show failover count regarding fine grained recovery > --- > > Key: FLINK-14164 > URL: https://issues.apache.org/jira/browse/FLINK-14164 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Metrics >Affects Versions: 1.10.0 >Reporter: Zhu Zhu >Priority: Major > Fix For: 1.10.0 > > > Previously Flink uses restart all strategy to recover jobs from failures. And > the metric "fullRestart" is used to show the count of failovers. > However, with fine grained recovery introduced in 1.9.0, the "fullRestart" > metric only reveals how many times the entire graph has been restarted, not > including the number of fine grained failure recoveries. > As many users want to build their job alerting based on failovers, I'd > propose to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} > which also respects fine grained recoveries. -- This message was sent by Atlassian Jira (v8.3.4#803005)