1s looks good to me.
And I think the conclusion that when a user should override the delay is
worth to be documented.

Thanks,
Zhu Zhu

Steven Wu <stevenz...@gmail.com> 于2019年9月3日周二 上午4:42写道:

> 1s sounds a good tradeoff to me.
>
> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <trohrm...@apache.org> wrote:
>
>> Thanks a lot for all your feedback. I see there is a slight tendency
>> towards having a non zero default delay so far.
>>
>> However, Yu has brought up some valid points. Maybe I can shed some light
>> on a).
>>
>> Before FLINK-9158 we set the default delay to 10s because Flink did not
>> support queued scheduling which meant that if one slot was missing/still
>> being occupied, then Flink would fail right away with
>> a NoResourceAvailableException. In order to prevent this we added the
>> delay. This also covered the case when the job was failing because of an
>> overloaded external system.
>>
>> When we finished FLIP-6, we thought that we could improve the user
>> experience by decreasing the default delay to 0s because all Flink related
>> problems (slot still occupied, slot missing because of reconnecting TM)
>> could be handled by the default slot request time out which allowed the
>> slots to become ready after the scheduling was kicked off. However, we did
>> not properly take the case of overloaded external systems into account.
>>
>> For b) I agree that any default value should be properly documented. This
>> was clearly an oversight when FLINK-9158 has been merged. Moreover, I
>> believe that there won't be the solve it all default value. There are
>> always cases where one needs to adapt it to ones needs. But this is ok. The
>> goal should be to find the default value which works for most cases.
>>
>> So maybe the middle ground between 10s and 0s could be a solution.
>> Setting the default restart delay to 1s should prevent restart storms
>> caused by overloaded external systems and still be fast enough to not slow
>> down recoveries noticeably in most cases. If one needs a super fast
>> recovery, then one should set the delay value to 0s. If one requires a
>> longer delay because of a particular infrastructure, then one needs to
>> change the value too. What do you think?
>>
>> Cheers,
>> Till
>>
>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <car...@gmail.com> wrote:
>>
>>> -1 on increasing the default delay to none zero, with below reasons:
>>>
>>> a) I could see some concerns about setting the delay to zero in the very
>>> original JIRA (FLINK-2993
>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still
>>> decided to make the change, so I'm wondering whether the decision also came
>>> from any customer requirement? If so, how could we judge whether one
>>> requirement override the other?
>>>
>>> b) There could be valid reasons for both default values depending on
>>> different use cases, as well as relative work around (like based on latest
>>> policy, setting the config manually to 10s could resolve the problem
>>> mentioned), and from former replies to this thread we could see users have
>>> already taken actions. Changing it back to non-zero again won't affect such
>>> users but might cause surprises to those depending on 0 as default.
>>>
>>> Last but not least, no matter what decision we make this time, I'd
>>> suggest to make it final and document in our release note explicitly.
>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>>> the change on default restart delay and we'd better learn from it this
>>> time. Thanks.
>>>
>>> [1]
>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>>
>>> Best Regards,
>>> Yu
>>>
>>>
>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>> +1 on what Zhu Zhu said.
>>>>
>>>> We also override the default to 10 s.
>>>>
>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <reed...@gmail.com> wrote:
>>>>
>>>>> In our production, we usually override the restart delay to be 10 s.
>>>>> We once encountered cases that external services are overwhelmed by
>>>>> reconnections from frequent restarted tasks.
>>>>> As a safer though not optimized option, a default delay larger than 0
>>>>> s is better in my opinion.
>>>>>
>>>>>
>>>>> 未来阳光 <2217232...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I thinks it's better to increase the default value. +1
>>>>>>
>>>>>>
>>>>>> Best.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------ 原始邮件 ------------------
>>>>>> 发件人: "Till Rohrmann"<trohrm...@apache.org>;
>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>>> 收件人: "dev"<d...@flink.apache.org>; "user"<user@flink.apache.org>;
>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I wanted to reach out to you and ask whether decreasing the default
>>>>>> delay
>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble.
>>>>>> A
>>>>>> user reported that he would like to increase the default value
>>>>>> because it
>>>>>> can cause restart storms in case of systematic faults [2].
>>>>>>
>>>>>> The downside of increasing the default delay would be a slightly
>>>>>> increased
>>>>>> restart time if this config option is not explicitly set.
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>
>>>>>

Reply via email to