Steven,

In my mind, Flink counter only stores its accumulated count and reports
that value. Are you using an external counter directly?
Maybe Flink Meter/MeterView is what you need? It stores the count and
calculates the rate. And it will report its "count" as well as "rate" to
external metric services.

The counter "task_failures" only works if the individual failover strategy
is enabled. However, it is not a public interface and is not suggested to
use, as the fine grained recovery (region failover) now supersedes it.
I've opened a ticket[1] to add a metric to show failovers that respects
fine grained recovery.

[1] https://issues.apache.org/jira/browse/FLINK-14164

Thanks,
Zhu Zhu

Steven Wu <stevenz...@gmail.com> 于2019年9月24日周二 上午6:41写道:

>
> When we setup alert like "fullRestarts > 1" for some rolling window, we
> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
> after a first full restart. So alert condition will always be true after
> first job restart. If we can apply a derivative to the Gauge value, I guess
> alert can probably work. I can explore if that is an option or not.
>
> Yeah. Understood that "fullRestart" won't increment when fine grained
> recovery happened. I think "task_failures" counter already exists in Flink.
>
>
>
> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <reed...@gmail.com> wrote:
>
>> Steven,
>>
>> Thanks for the information. If we can determine this a common issue, we
>> can solve it in Flink core.
>> To get to that state, I have two questions which need your help:
>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>> Gauge<Long>. Does the metric reporter you use report Counter and
>> Gauge<Long> to external services in different ways? Or anything else can be
>> different due to the metric type?
>> 2. Is the "number of restarts" what you actually need, rather than
>> the "fullRestart" count? If so, I believe we will have such a counter
>> metric in 1.10, since the previous "fullRestart" metric value is not the
>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>     "fullRestart" reveals how many times entire job graph has been
>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>> would not be restarted when task failures happen and the "fullRestart"
>> value will not increment in such cases.
>>
>> I'd appreciate if you can help with these questions and we can make
>> better decisions for Flink.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>
>>> Zhu Zhu,
>>>
>>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>>> We publish an equivalent Counter metric for alerting purpose.
>>>
>>> Thanks,
>>> Steven
>>>
>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote:
>>>
>>>> Thanks Steven for the feedback!
>>>> Could you share more information about the metrics you add in you
>>>> customized restart strategy?
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>
>>>>> We do use config like "restart-strategy:
>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>> metrics than the Flink provided ones.
>>>>>
>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>
>>>>>> Thanks everyone for the input.
>>>>>>
>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>> interface as it is not explicitly documented.
>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>> scheduler in Flink 1.10
>>>>>>
>>>>>> Other usages are still supported, including all the strategies and
>>>>>> configuring ways described in
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>> .
>>>>>>
>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>
>>>>>>> Thanks Oytun for the reply!
>>>>>>>
>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>
>>>>>>> The usage of restart strategies you mentioned will keep working with
>>>>>>> the new scheduler.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>
>>>>>>>> Hi Zhu,
>>>>>>>>
>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>
>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>
>>>>>>>>
>>>>>>>> ---
>>>>>>>> Oytun Tez
>>>>>>>>
>>>>>>>> *M O T A W O R D*
>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>> oy...@motaword.com — www.motaword.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>
>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>> interacts with restart strategies in a different way. We have to 
>>>>>>>>> re-design
>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will 
>>>>>>>>> not
>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>
>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>
>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>> using customized RestartStrategy. That will be valuable for use to 
>>>>>>>>> make
>>>>>>>>> decisions.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>

Reply via email to