Steven, In my mind, Flink counter only stores its accumulated count and reports that value. Are you using an external counter directly? Maybe Flink Meter/MeterView is what you need? It stores the count and calculates the rate. And it will report its "count" as well as "rate" to external metric services.
The counter "task_failures" only works if the individual failover strategy is enabled. However, it is not a public interface and is not suggested to use, as the fine grained recovery (region failover) now supersedes it. I've opened a ticket[1] to add a metric to show failovers that respects fine grained recovery. [1] https://issues.apache.org/jira/browse/FLINK-14164 Thanks, Zhu Zhu Steven Wu <stevenz...@gmail.com> 于2019年9月24日周二 上午6:41写道: > > When we setup alert like "fullRestarts > 1" for some rolling window, we > want to use counter. if it is a Gauge, "fullRestarts" will never go below 1 > after a first full restart. So alert condition will always be true after > first job restart. If we can apply a derivative to the Gauge value, I guess > alert can probably work. I can explore if that is an option or not. > > Yeah. Understood that "fullRestart" won't increment when fine grained > recovery happened. I think "task_failures" counter already exists in Flink. > > > > On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <reed...@gmail.com> wrote: > >> Steven, >> >> Thanks for the information. If we can determine this a common issue, we >> can solve it in Flink core. >> To get to that state, I have two questions which need your help: >> 1. Why is gauge not good for alerting? The metric "fullRestart" is a >> Gauge<Long>. Does the metric reporter you use report Counter and >> Gauge<Long> to external services in different ways? Or anything else can be >> different due to the metric type? >> 2. Is the "number of restarts" what you actually need, rather than >> the "fullRestart" count? If so, I believe we will have such a counter >> metric in 1.10, since the previous "fullRestart" metric value is not the >> number of restarts when grained recovery (feature added 1.9.0) is enabled. >> "fullRestart" reveals how many times entire job graph has been >> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph >> would not be restarted when task failures happen and the "fullRestart" >> value will not increment in such cases. >> >> I'd appreciate if you can help with these questions and we can make >> better decisions for Flink. >> >> Thanks, >> Zhu Zhu >> >> Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道: >> >>> Zhu Zhu, >>> >>> Flink fullRestart metric is a Gauge, which is not good for alerting on. >>> We publish an equivalent Counter metric for alerting purpose. >>> >>> Thanks, >>> Steven >>> >>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote: >>> >>>> Thanks Steven for the feedback! >>>> Could you share more information about the metrics you add in you >>>> customized restart strategy? >>>> >>>> Thanks, >>>> Zhu Zhu >>>> >>>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道: >>>> >>>>> We do use config like "restart-strategy: >>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>>> metrics than the Flink provided ones. >>>>> >>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote: >>>>> >>>>>> Thanks everyone for the input. >>>>>> >>>>>> The RestartStrategy customization is not recognized as a public >>>>>> interface as it is not explicitly documented. >>>>>> As it is not used from the feedbacks of this survey, I'll conclude >>>>>> that we do not need to support customized RestartStrategy for the new >>>>>> scheduler in Flink 1.10 >>>>>> >>>>>> Other usages are still supported, including all the strategies and >>>>>> configuring ways described in >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>>> . >>>>>> >>>>>> Feel free to share in this thread if you has any concern for it. >>>>>> >>>>>> Thanks, >>>>>> Zhu Zhu >>>>>> >>>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道: >>>>>> >>>>>>> Thanks Oytun for the reply! >>>>>>> >>>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>>> RestartStrategy", we mean that users implement an >>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* >>>>>>> by themselves and use it by configuring like "restart-strategy: >>>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>>> >>>>>>> The usage of restart strategies you mentioned will keep working with >>>>>>> the new scheduler. >>>>>>> >>>>>>> Thanks, >>>>>>> Zhu Zhu >>>>>>> >>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道: >>>>>>> >>>>>>>> Hi Zhu, >>>>>>>> >>>>>>>> We are using custom restart strategy like this: >>>>>>>> >>>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>>> >>>>>>>> >>>>>>>> --- >>>>>>>> Oytun Tez >>>>>>>> >>>>>>>> *M O T A W O R D* >>>>>>>> The World's Fastest Human Translation Platform. >>>>>>>> oy...@motaword.com — www.motaword.com >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> I wanted to reach out to you and ask how many of you are using a >>>>>>>>> customized RestartStrategy[1] in production jobs. >>>>>>>>> >>>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>>> interacts with restart strategies in a different way. We have to >>>>>>>>> re-design >>>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will >>>>>>>>> not >>>>>>>>> work any more with the new scheduler. >>>>>>>>> >>>>>>>>> We want to know whether we should keep the way >>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>>>> RestartStrategy can be migrated. >>>>>>>>> >>>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>>> using customized RestartStrategy. That will be valuable for use to >>>>>>>>> make >>>>>>>>> decisions. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Zhu Zhu >>>>>>>>> >>>>>>>>