Re: [SURVEY] How many people are using customized RestartStrategy(s)

Steven Wu Wed, 25 Sep 2019 09:14:00 -0700

Zhu Zhu, that is correct.

On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <[email protected]> wrote:


> Hi Steven,
>
> As a conclusion, since we will have a meter metric[1] for restarts,
> customized restart strategy is not needed in your case.
> Is that right?
>
> [1] https://issues.apache.org/jira/browse/FLINK-14164
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <[email protected]> 于2019年9月25日周三 上午2:30写道：
>
>> Zhu Zhu,
>>
>> Sorry, I was using different terminology. yes, Flink meter is what I was
>> talking about regarding "fullRestarts" for threshold based alerting.
>>
>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <[email protected]> wrote:
>>
>>> Steven,
>>>
>>> In my mind, Flink counter only stores its accumulated count and reports
>>> that value. Are you using an external counter directly?
>>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>>> calculates the rate. And it will report its "count" as well as "rate" to
>>> external metric services.
>>>
>>> The counter "task_failures" only works if the individual failover
>>> strategy is enabled. However, it is not a public interface and is not
>>> suggested to use, as the fine grained recovery (region failover) now
>>> supersedes it.
>>> I've opened a ticket[1] to add a metric to show failovers that respects
>>> fine grained recovery.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <[email protected]> 于2019年9月24日周二 上午6:41写道：
>>>
>>>>
>>>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>>>> after a first full restart. So alert condition will always be true after
>>>> first job restart. If we can apply a derivative to the Gauge value, I guess
>>>> alert can probably work. I can explore if that is an option or not.
>>>>
>>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>>
>>>>
>>>>
>>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[email protected]> wrote:
>>>>
>>>>> Steven,
>>>>>
>>>>> Thanks for the information. If we can determine this a common issue,
>>>>> we can solve it in Flink core.
>>>>> To get to that state, I have two questions which need your help:
>>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>>> Gauge<Long> to external services in different ways? Or anything else can 
>>>>> be
>>>>> different due to the metric type?
>>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>>     "fullRestart" reveals how many times entire job graph has been
>>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>>> would not be restarted when task failures happen and the "fullRestart"
>>>>> value will not increment in such cases.
>>>>>
>>>>> I'd appreciate if you can help with these questions and we can make
>>>>> better decisions for Flink.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Steven Wu <[email protected]> 于2019年9月22日周日 上午3:31写道：
>>>>>
>>>>>> Zhu Zhu,
>>>>>>
>>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>>
>>>>>> Thanks,
>>>>>> Steven
>>>>>>
>>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks Steven for the feedback!
>>>>>>> Could you share more information about the metrics you add in you
>>>>>>> customized restart strategy?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Steven Wu <[email protected]> 于2019年9月20日周五 上午7:11写道：
>>>>>>>
>>>>>>>> We do use config like "restart-strategy:
>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>>> metrics than the Flink provided ones.
>>>>>>>>
>>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks everyone for the input.
>>>>>>>>>
>>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>>> interface as it is not explicitly documented.
>>>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>>>> scheduler in Flink 1.10
>>>>>>>>>
>>>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>>>> configuring ways described in
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>> Zhu Zhu <[email protected]> 于2019年9月12日周四 下午10:33写道：
>>>>>>>>>
>>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>>
>>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>>
>>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>>> with the new scheduler.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>> Oytun Tez <[email protected]> 于2019年9月12日周四 下午10:05写道：
>>>>>>>>>>
>>>>>>>>>>> Hi Zhu,
>>>>>>>>>>>
>>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>>
>>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> Oytun Tez
>>>>>>>>>>>
>>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>>> [email protected] — www.motaword.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using
>>>>>>>>>>>> a customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>>
>>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>>> interacts with restart strategies in a different way. We have to 
>>>>>>>>>>>> re-design
>>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy 
>>>>>>>>>>>> will not
>>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>>
>>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing 
>>>>>>>>>>>> customized
>>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to 
>>>>>>>>>>>> make
>>>>>>>>>>>> decisions.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Reply via email to