Re: [SURVEY] Is the default restart delay of 0s causing problems?

Till Rohrmann Tue, 03 Sep 2019 02:42:58 -0700

The FLIP-62 discuss thread can be found here [1].

[1]
https://lists.apache.org/thread.html/9602b342602a0181fcb618581f3b12e692ed2fad98c59fd6c1caeabd@%3Cdev.flink.apache.org%3E


Cheers,
Till

On Tue, Sep 3, 2019 at 11:13 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Thanks everyone for the input again. I'll then conclude this survey thread
> and start a discuss thread to set the default restart delay to 1s.
>
> @Arvid, I agree that a better documentation how to tune Flink with sane
> settings for certain scenarios is super helpful. However, as you've said it
> is somewhat hijacking the discussion and I would exclude it from my
> proposed changes. The best thing to do would be to start a separate
> discussion/effort for it.
>
> Concerning the restart strategy configuration options, they are currently
> only documented here [1]. I'm about to change it with this PR [2].
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html
> [2] https://github.com/apache/flink/pull/9562
>
> Cheers,
> Till
>
> On Tue, Sep 3, 2019 at 8:21 AM Arvid Heise <ar...@data-artisans.com>
> wrote:
>
>> Hi all,
>>
>> just wanted to share my experience with configurations with you. For
>> non-expert users configurations of Flink can be very daunting. The list of
>> common properties is already helping a lot [1], but it's not clear how they
>> depend on each other and settings common for specific use cases are not
>> listed.
>>
>> If we can give somewhat clear recommendations for the start for the most
>> common use cases (batch small/large cluster, streaming high throughput/low
>> latency), I think users would be able start much more quickly with a
>> somewhat well-configured system and fine-tune the settings later. For
>> example, Kafka Streams has a section on how to set the parameters for
>> maximum resilience [2].
>>
>> I'd propose to leave the current configuration page as a reference page,
>> but also have a recommended configuration settings page that's directly
>> linked in the first section, such that new users are not overwhelmed.
>>
>> Sorry if this response is hijacking the discussion.
>> Btw, is restart-strategy configuration missing in the main configuration
>> page? Is this a conscious decision?
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options
>> [2]
>> https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency
>>
>> On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <reed...@gmail.com> wrote:
>>
>>> 1s looks good to me.
>>> And I think the conclusion that when a user should override the delay is
>>> worth to be documented.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <stevenz...@gmail.com> 于2019年9月3日周二 上午4:42写道：
>>>
>>>> 1s sounds a good tradeoff to me.
>>>>
>>>> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <trohrm...@apache.org>
>>>> wrote:
>>>>
>>>>> Thanks a lot for all your feedback. I see there is a slight tendency
>>>>> towards having a non zero default delay so far.
>>>>>
>>>>> However, Yu has brought up some valid points. Maybe I can shed some
>>>>> light on a).
>>>>>
>>>>> Before FLINK-9158 we set the default delay to 10s because Flink did
>>>>> not support queued scheduling which meant that if one slot was
>>>>> missing/still being occupied, then Flink would fail right away with
>>>>> a NoResourceAvailableException. In order to prevent this we added the
>>>>> delay. This also covered the case when the job was failing because of an
>>>>> overloaded external system.
>>>>>
>>>>> When we finished FLIP-6, we thought that we could improve the user
>>>>> experience by decreasing the default delay to 0s because all Flink related
>>>>> problems (slot still occupied, slot missing because of reconnecting TM)
>>>>> could be handled by the default slot request time out which allowed the
>>>>> slots to become ready after the scheduling was kicked off. However, we did
>>>>> not properly take the case of overloaded external systems into account.
>>>>>
>>>>> For b) I agree that any default value should be properly documented.
>>>>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I
>>>>> believe that there won't be the solve it all default value. There are
>>>>> always cases where one needs to adapt it to ones needs. But this is ok. 
>>>>> The
>>>>> goal should be to find the default value which works for most cases.
>>>>>
>>>>> So maybe the middle ground between 10s and 0s could be a solution.
>>>>> Setting the default restart delay to 1s should prevent restart storms
>>>>> caused by overloaded external systems and still be fast enough to not slow
>>>>> down recoveries noticeably in most cases. If one needs a super fast
>>>>> recovery, then one should set the delay value to 0s. If one requires a
>>>>> longer delay because of a particular infrastructure, then one needs to
>>>>> change the value too. What do you think?
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <car...@gmail.com> wrote:
>>>>>
>>>>>> -1 on increasing the default delay to none zero, with below reasons:
>>>>>>
>>>>>> a) I could see some concerns about setting the delay to zero in the
>>>>>> very original JIRA (FLINK-2993
>>>>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>>>>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we
>>>>>> still decided to make the change, so I'm wondering whether the decision
>>>>>> also came from any customer requirement? If so, how could we judge 
>>>>>> whether
>>>>>> one requirement override the other?
>>>>>>
>>>>>> b) There could be valid reasons for both default values depending on
>>>>>> different use cases, as well as relative work around (like based on 
>>>>>> latest
>>>>>> policy, setting the config manually to 10s could resolve the problem
>>>>>> mentioned), and from former replies to this thread we could see users 
>>>>>> have
>>>>>> already taken actions. Changing it back to non-zero again won't affect 
>>>>>> such
>>>>>> users but might cause surprises to those depending on 0 as default.
>>>>>>
>>>>>> Last but not least, no matter what decision we make this time, I'd
>>>>>> suggest to make it final and document in our release note explicitly.
>>>>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>>>>>> the change on default restart delay and we'd better learn from it this
>>>>>> time. Thanks.
>>>>>>
>>>>>> [1]
>>>>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>>>>>> [2]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>>>>>
>>>>>> Best Regards,
>>>>>> Yu
>>>>>>
>>>>>>
>>>>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <stevenz...@gmail.com> wrote:
>>>>>>
>>>>>>> +1 on what Zhu Zhu said.
>>>>>>>
>>>>>>> We also override the default to 10 s.
>>>>>>>
>>>>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>>>
>>>>>>>> In our production, we usually override the restart delay to be 10 s.
>>>>>>>> We once encountered cases that external services are overwhelmed by
>>>>>>>> reconnections from frequent restarted tasks.
>>>>>>>> As a safer though not optimized option, a default delay larger than
>>>>>>>> 0 s is better in my opinion.
>>>>>>>>
>>>>>>>>
>>>>>>>> 未来阳光 <2217232...@qq.com> 于2019年8月30日周五 下午10:23写道：
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I thinks it's better to increase the default value. +1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------ 原始邮件 ------------------
>>>>>>>>> 发件人: "Till Rohrmann"<trohrm...@apache.org>;
>>>>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>>>>>> 收件人: "dev"<d...@flink.apache.org>; "user"<user@flink.apache.org>;
>>>>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I wanted to reach out to you and ask whether decreasing the
>>>>>>>>> default delay
>>>>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing
>>>>>>>>> trouble. A
>>>>>>>>> user reported that he would like to increase the default value
>>>>>>>>> because it
>>>>>>>>> can cause restart storms in case of systematic faults [2].
>>>>>>>>>
>>>>>>>>> The downside of increasing the default delay would be a slightly
>>>>>>>>> increased
>>>>>>>>> restart time if this config option is not explicitly set.
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Till
>>>>>>>>
>>>>>>>>
>>
>> --
>>
>> Arvid Heise | Senior Software Engineer
>>
>> <https://www.ververica.com/>
>>
>> Follow us @VervericaData
>>
>> --
>>
>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>> Conference
>>
>> Stream Processing | Event Driven | Real Time
>>
>> --
>>
>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>
>> --
>> Ververica GmbH
>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>>
>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Reply via email to