When we setup alert like "fullRestarts > 1" for some rolling window, we want to use counter. if it is a Gauge, "fullRestarts" will never go below 1 after a first full restart. So alert condition will always be true after first job restart. If we can apply a derivative to the Gauge value, I guess alert can probably work. I can explore if that is an option or not.
Yeah. Understood that "fullRestart" won't increment when fine grained recovery happened. I think "task_failures" counter already exists in Flink. On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <reed...@gmail.com> wrote: > Steven, > > Thanks for the information. If we can determine this a common issue, we > can solve it in Flink core. > To get to that state, I have two questions which need your help: > 1. Why is gauge not good for alerting? The metric "fullRestart" is a > Gauge<Long>. Does the metric reporter you use report Counter and > Gauge<Long> to external services in different ways? Or anything else can be > different due to the metric type? > 2. Is the "number of restarts" what you actually need, rather than > the "fullRestart" count? If so, I believe we will have such a counter > metric in 1.10, since the previous "fullRestart" metric value is not the > number of restarts when grained recovery (feature added 1.9.0) is enabled. > "fullRestart" reveals how many times entire job graph has been > restarted. If grained recovery (feature added 1.9.0) is enabled, the graph > would not be restarted when task failures happen and the "fullRestart" > value will not increment in such cases. > > I'd appreciate if you can help with these questions and we can make better > decisions for Flink. > > Thanks, > Zhu Zhu > > Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道: > >> Zhu Zhu, >> >> Flink fullRestart metric is a Gauge, which is not good for alerting on. >> We publish an equivalent Counter metric for alerting purpose. >> >> Thanks, >> Steven >> >> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote: >> >>> Thanks Steven for the feedback! >>> Could you share more information about the metrics you add in you >>> customized restart strategy? >>> >>> Thanks, >>> Zhu Zhu >>> >>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道: >>> >>>> We do use config like "restart-strategy: >>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>> metrics than the Flink provided ones. >>>> >>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote: >>>> >>>>> Thanks everyone for the input. >>>>> >>>>> The RestartStrategy customization is not recognized as a public >>>>> interface as it is not explicitly documented. >>>>> As it is not used from the feedbacks of this survey, I'll conclude >>>>> that we do not need to support customized RestartStrategy for the new >>>>> scheduler in Flink 1.10 >>>>> >>>>> Other usages are still supported, including all the strategies and >>>>> configuring ways described in >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>> . >>>>> >>>>> Feel free to share in this thread if you has any concern for it. >>>>> >>>>> Thanks, >>>>> Zhu Zhu >>>>> >>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道: >>>>> >>>>>> Thanks Oytun for the reply! >>>>>> >>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>> RestartStrategy", we mean that users implement an >>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by >>>>>> themselves and use it by configuring like "restart-strategy: >>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>> >>>>>> The usage of restart strategies you mentioned will keep working with >>>>>> the new scheduler. >>>>>> >>>>>> Thanks, >>>>>> Zhu Zhu >>>>>> >>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道: >>>>>> >>>>>>> Hi Zhu, >>>>>>> >>>>>>> We are using custom restart strategy like this: >>>>>>> >>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>> >>>>>>> >>>>>>> --- >>>>>>> Oytun Tez >>>>>>> >>>>>>> *M O T A W O R D* >>>>>>> The World's Fastest Human Translation Platform. >>>>>>> oy...@motaword.com — www.motaword.com >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> I wanted to reach out to you and ask how many of you are using a >>>>>>>> customized RestartStrategy[1] in production jobs. >>>>>>>> >>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>> interacts with restart strategies in a different way. We have to >>>>>>>> re-design >>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will >>>>>>>> not >>>>>>>> work any more with the new scheduler. >>>>>>>> >>>>>>>> We want to know whether we should keep the way >>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>>> RestartStrategy can be migrated. >>>>>>>> >>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>> using customized RestartStrategy. That will be valuable for use to make >>>>>>>> decisions. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Zhu Zhu >>>>>>>> >>>>>>>