Zhu Zhu, that is correct. On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <reed...@gmail.com> wrote:
> Hi Steven, > > As a conclusion, since we will have a meter metric[1] for restarts, > customized restart strategy is not needed in your case. > Is that right? > > [1] https://issues.apache.org/jira/browse/FLINK-14164 > > Thanks, > Zhu Zhu > > Steven Wu <stevenz...@gmail.com> 于2019年9月25日周三 上午2:30写道: > >> Zhu Zhu, >> >> Sorry, I was using different terminology. yes, Flink meter is what I was >> talking about regarding "fullRestarts" for threshold based alerting. >> >> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <reed...@gmail.com> wrote: >> >>> Steven, >>> >>> In my mind, Flink counter only stores its accumulated count and reports >>> that value. Are you using an external counter directly? >>> Maybe Flink Meter/MeterView is what you need? It stores the count and >>> calculates the rate. And it will report its "count" as well as "rate" to >>> external metric services. >>> >>> The counter "task_failures" only works if the individual failover >>> strategy is enabled. However, it is not a public interface and is not >>> suggested to use, as the fine grained recovery (region failover) now >>> supersedes it. >>> I've opened a ticket[1] to add a metric to show failovers that respects >>> fine grained recovery. >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-14164 >>> >>> Thanks, >>> Zhu Zhu >>> >>> Steven Wu <stevenz...@gmail.com> 于2019年9月24日周二 上午6:41写道: >>> >>>> >>>> When we setup alert like "fullRestarts > 1" for some rolling window, we >>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1 >>>> after a first full restart. So alert condition will always be true after >>>> first job restart. If we can apply a derivative to the Gauge value, I guess >>>> alert can probably work. I can explore if that is an option or not. >>>> >>>> Yeah. Understood that "fullRestart" won't increment when fine grained >>>> recovery happened. I think "task_failures" counter already exists in Flink. >>>> >>>> >>>> >>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <reed...@gmail.com> wrote: >>>> >>>>> Steven, >>>>> >>>>> Thanks for the information. If we can determine this a common issue, >>>>> we can solve it in Flink core. >>>>> To get to that state, I have two questions which need your help: >>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a >>>>> Gauge<Long>. Does the metric reporter you use report Counter and >>>>> Gauge<Long> to external services in different ways? Or anything else can >>>>> be >>>>> different due to the metric type? >>>>> 2. Is the "number of restarts" what you actually need, rather than >>>>> the "fullRestart" count? If so, I believe we will have such a counter >>>>> metric in 1.10, since the previous "fullRestart" metric value is not the >>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled. >>>>> "fullRestart" reveals how many times entire job graph has been >>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph >>>>> would not be restarted when task failures happen and the "fullRestart" >>>>> value will not increment in such cases. >>>>> >>>>> I'd appreciate if you can help with these questions and we can make >>>>> better decisions for Flink. >>>>> >>>>> Thanks, >>>>> Zhu Zhu >>>>> >>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道: >>>>> >>>>>> Zhu Zhu, >>>>>> >>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting >>>>>> on. We publish an equivalent Counter metric for alerting purpose. >>>>>> >>>>>> Thanks, >>>>>> Steven >>>>>> >>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote: >>>>>> >>>>>>> Thanks Steven for the feedback! >>>>>>> Could you share more information about the metrics you add in you >>>>>>> customized restart strategy? >>>>>>> >>>>>>> Thanks, >>>>>>> Zhu Zhu >>>>>>> >>>>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道: >>>>>>> >>>>>>>> We do use config like "restart-strategy: >>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>>>>>> metrics than the Flink provided ones. >>>>>>>> >>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Thanks everyone for the input. >>>>>>>>> >>>>>>>>> The RestartStrategy customization is not recognized as a public >>>>>>>>> interface as it is not explicitly documented. >>>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude >>>>>>>>> that we do not need to support customized RestartStrategy for the new >>>>>>>>> scheduler in Flink 1.10 >>>>>>>>> >>>>>>>>> Other usages are still supported, including all the strategies and >>>>>>>>> configuring ways described in >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>>>>>> . >>>>>>>>> >>>>>>>>> Feel free to share in this thread if you has any concern for it. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Zhu Zhu >>>>>>>>> >>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道: >>>>>>>>> >>>>>>>>>> Thanks Oytun for the reply! >>>>>>>>>> >>>>>>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>>>>>> RestartStrategy", we mean that users implement an >>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* >>>>>>>>>> by themselves and use it by configuring like "restart-strategy: >>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>>>>>> >>>>>>>>>> The usage of restart strategies you mentioned will keep working >>>>>>>>>> with the new scheduler. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Zhu Zhu >>>>>>>>>> >>>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道: >>>>>>>>>> >>>>>>>>>>> Hi Zhu, >>>>>>>>>>> >>>>>>>>>>> We are using custom restart strategy like this: >>>>>>>>>>> >>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> --- >>>>>>>>>>> Oytun Tez >>>>>>>>>>> >>>>>>>>>>> *M O T A W O R D* >>>>>>>>>>> The World's Fastest Human Translation Platform. >>>>>>>>>>> oy...@motaword.com — www.motaword.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi everyone, >>>>>>>>>>>> >>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using >>>>>>>>>>>> a customized RestartStrategy[1] in production jobs. >>>>>>>>>>>> >>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>>>>>> interacts with restart strategies in a different way. We have to >>>>>>>>>>>> re-design >>>>>>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy >>>>>>>>>>>> will not >>>>>>>>>>>> work any more with the new scheduler. >>>>>>>>>>>> >>>>>>>>>>>> We want to know whether we should keep the way >>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing >>>>>>>>>>>> customized >>>>>>>>>>>> RestartStrategy can be migrated. >>>>>>>>>>>> >>>>>>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to >>>>>>>>>>>> make >>>>>>>>>>>> decisions. >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Zhu Zhu >>>>>>>>>>>> >>>>>>>>>>>