Re: [SURVEY] How many people are using customized RestartStrategy(s)

Steven Wu Mon, 23 Sep 2019 15:42:36 -0700

When we setup alert like "fullRestarts > 1" for some rolling window, we
want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
after a first full restart. So alert condition will always be true after
first job restart. If we can apply a derivative to the Gauge value, I guess
alert can probably work. I can explore if that is an option or not.


Yeah. Understood that "fullRestart" won't increment when fine grained
recovery happened. I think "task_failures" counter already exists in Flink.



On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[email protected]> wrote:

> Steven,
>
> Thanks for the information. If we can determine this a common issue, we
> can solve it in Flink core.
> To get to that state, I have two questions which need your help:
> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
> Gauge<Long>. Does the metric reporter you use report Counter and
> Gauge<Long> to external services in different ways? Or anything else can be
> different due to the metric type?
> 2. Is the "number of restarts" what you actually need, rather than
> the "fullRestart" count? If so, I believe we will have such a counter
> metric in 1.10, since the previous "fullRestart" metric value is not the
> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>     "fullRestart" reveals how many times entire job graph has been
> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
> would not be restarted when task failures happen and the "fullRestart"
> value will not increment in such cases.
>
> I'd appreciate if you can help with these questions and we can make better
> decisions for Flink.
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <[email protected]> 于2019年9月22日周日 上午3:31写道：
>
>> Zhu Zhu,
>>
>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>> We publish an equivalent Counter metric for alerting purpose.
>>
>> Thanks,
>> Steven
>>
>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[email protected]> wrote:
>>
>>> Thanks Steven for the feedback!
>>> Could you share more information about the metrics you add in you
>>> customized restart strategy?
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <[email protected]> 于2019年9月20日周五 上午7:11写道：
>>>
>>>> We do use config like "restart-strategy:
>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>> metrics than the Flink provided ones.
>>>>
>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[email protected]> wrote:
>>>>
>>>>> Thanks everyone for the input.
>>>>>
>>>>> The RestartStrategy customization is not recognized as a public
>>>>> interface as it is not explicitly documented.
>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>> that we do not need to support customized RestartStrategy for the new
>>>>> scheduler in Flink 1.10
>>>>>
>>>>> Other usages are still supported, including all the strategies and
>>>>> configuring ways described in
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>> .
>>>>>
>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Zhu Zhu <[email protected]> 于2019年9月12日周四 下午10:33写道：
>>>>>
>>>>>> Thanks Oytun for the reply!
>>>>>>
>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>> RestartStrategy", we mean that users implement an
>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>>>> themselves and use it by configuring like "restart-strategy:
>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>
>>>>>> The usage of restart strategies you mentioned will keep working with
>>>>>> the new scheduler.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Oytun Tez <[email protected]> 于2019年9月12日周四 下午10:05写道：
>>>>>>
>>>>>>> Hi Zhu,
>>>>>>>
>>>>>>> We are using custom restart strategy like this:
>>>>>>>
>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>> Oytun Tez
>>>>>>>
>>>>>>> *M O T A W O R D*
>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>> [email protected] — www.motaword.com
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>
>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>> interacts with restart strategies in a different way. We have to 
>>>>>>>> re-design
>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will 
>>>>>>>> not
>>>>>>>> work any more with the new scheduler.
>>>>>>>>
>>>>>>>> We want to know whether we should keep the way
>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>
>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>> decisions.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Reply via email to