[ 
https://issues.apache.org/jira/browse/FLINK-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Zhen Wu updated FLINK-8043:
----------------------------------
    Description: 
Fine grained recovery publish fullRestarts as guage, which is not suitable for 
threshold based alerting. Usually we would alert like "fullRestarts > 0 happens 
10 times in last 15 minutes".

In comparison, "task_failures" is published as counter.

  was:When fine grained recovery failed (e.g. due to not enough taskmager slots 
when replacement taskmanager node didn't come back in time), Flink will revert 
to full job restart. In this case, it should also increment "job restart" metric

        Summary: change fullRestarts (for fine grained recovery) from guage to 
counter  (was: increment job restart metric when fine grained recovery reverted 
to full job restart)

> change fullRestarts (for fine grained recovery) from guage to counter
> ---------------------------------------------------------------------
>
>                 Key: FLINK-8043
>                 URL: https://issues.apache.org/jira/browse/FLINK-8043
>             Project: Flink
>          Issue Type: Bug
>          Components: ResourceManager
>    Affects Versions: 1.3.2
>            Reporter: Steven Zhen Wu
>
> Fine grained recovery publish fullRestarts as guage, which is not suitable 
> for threshold based alerting. Usually we would alert like "fullRestarts > 0 
> happens 10 times in last 15 minutes".
> In comparison, "task_failures" is published as counter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to