We will then keep the decision that we do not support customized restart
strategy in Flink 1.10.

Thanks Steven for the inputs!

Thanks,
Zhu Zhu

Steven Wu <stevenz...@gmail.com> 于2019年9月26日周四 上午12:13写道:

> Zhu Zhu, that is correct.
>
> On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <reed...@gmail.com> wrote:
>
>> Hi Steven,
>>
>> As a conclusion, since we will have a meter metric[1] for restarts,
>> customized restart strategy is not needed in your case.
>> Is that right?
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <stevenz...@gmail.com> 于2019年9月25日周三 上午2:30写道:
>>
>>> Zhu Zhu,
>>>
>>> Sorry, I was using different terminology. yes, Flink meter is what I was
>>> talking about regarding "fullRestarts" for threshold based alerting.
>>>
>>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <reed...@gmail.com> wrote:
>>>
>>>> Steven,
>>>>
>>>> In my mind, Flink counter only stores its accumulated count and reports
>>>> that value. Are you using an external counter directly?
>>>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>>>> calculates the rate. And it will report its "count" as well as "rate" to
>>>> external metric services.
>>>>
>>>> The counter "task_failures" only works if the individual failover
>>>> strategy is enabled. However, it is not a public interface and is not
>>>> suggested to use, as the fine grained recovery (region failover) now
>>>> supersedes it.
>>>> I've opened a ticket[1] to add a metric to show failovers that respects
>>>> fine grained recovery.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月24日周二 上午6:41写道:
>>>>
>>>>>
>>>>> When we setup alert like "fullRestarts > 1" for some rolling window,
>>>>> we want to use counter. if it is a Gauge, "fullRestarts" will never go
>>>>> below 1 after a first full restart. So alert condition will always be true
>>>>> after first job restart. If we can apply a derivative to the Gauge value, 
>>>>> I
>>>>> guess alert can probably work. I can explore if that is an option or not.
>>>>>
>>>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>>>> recovery happened. I think "task_failures" counter already exists in 
>>>>> Flink.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>
>>>>>> Steven,
>>>>>>
>>>>>> Thanks for the information. If we can determine this a common issue,
>>>>>> we can solve it in Flink core.
>>>>>> To get to that state, I have two questions which need your help:
>>>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>>>> Gauge<Long> to external services in different ways? Or anything else can 
>>>>>> be
>>>>>> different due to the metric type?
>>>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>>>> number of restarts when grained recovery (feature added 1.9.0) is 
>>>>>> enabled.
>>>>>>     "fullRestart" reveals how many times entire job graph has been
>>>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the 
>>>>>> graph
>>>>>> would not be restarted when task failures happen and the "fullRestart"
>>>>>> value will not increment in such cases.
>>>>>>
>>>>>> I'd appreciate if you can help with these questions and we can make
>>>>>> better decisions for Flink.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>>>>>
>>>>>>> Zhu Zhu,
>>>>>>>
>>>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Steven
>>>>>>>
>>>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Steven for the feedback!
>>>>>>>> Could you share more information about the metrics you add in you
>>>>>>>> customized restart strategy?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>>>>>
>>>>>>>>> We do use config like "restart-strategy:
>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>>>> metrics than the Flink provided ones.
>>>>>>>>>
>>>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks everyone for the input.
>>>>>>>>>>
>>>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>>>> interface as it is not explicitly documented.
>>>>>>>>>> As it is not used from the feedbacks of this survey, I'll
>>>>>>>>>> conclude that we do not need to support customized RestartStrategy 
>>>>>>>>>> for the
>>>>>>>>>> new scheduler in Flink 1.10
>>>>>>>>>>
>>>>>>>>>> Other usages are still supported, including all the strategies
>>>>>>>>>> and configuring ways described in
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>>>>>
>>>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>>>
>>>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>>>
>>>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>>>> with the new scheduler.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>
>>>>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Zhu,
>>>>>>>>>>>>
>>>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>>>
>>>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>> Oytun Tez
>>>>>>>>>>>>
>>>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>>>> oy...@motaword.com — www.motaword.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using
>>>>>>>>>>>>> a customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>>>> interacts with restart strategies in a different way. We have to 
>>>>>>>>>>>>> re-design
>>>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy 
>>>>>>>>>>>>> will not
>>>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing 
>>>>>>>>>>>>> customized
>>>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use 
>>>>>>>>>>>>> to make
>>>>>>>>>>>>> decisions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>>>
>>>>>>>>>>>>

Reply via email to