The FLIP-62 discuss thread can be found here [1]. [1] https://lists.apache.org/thread.html/9602b342602a0181fcb618581f3b12e692ed2fad98c59fd6c1caeabd@%3Cdev.flink.apache.org%3E
Cheers, Till On Tue, Sep 3, 2019 at 11:13 AM Till Rohrmann <trohrm...@apache.org> wrote: > Thanks everyone for the input again. I'll then conclude this survey thread > and start a discuss thread to set the default restart delay to 1s. > > @Arvid, I agree that a better documentation how to tune Flink with sane > settings for certain scenarios is super helpful. However, as you've said it > is somewhat hijacking the discussion and I would exclude it from my > proposed changes. The best thing to do would be to start a separate > discussion/effort for it. > > Concerning the restart strategy configuration options, they are currently > only documented here [1]. I'm about to change it with this PR [2]. > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html > [2] https://github.com/apache/flink/pull/9562 > > Cheers, > Till > > On Tue, Sep 3, 2019 at 8:21 AM Arvid Heise <ar...@data-artisans.com> > wrote: > >> Hi all, >> >> just wanted to share my experience with configurations with you. For >> non-expert users configurations of Flink can be very daunting. The list of >> common properties is already helping a lot [1], but it's not clear how they >> depend on each other and settings common for specific use cases are not >> listed. >> >> If we can give somewhat clear recommendations for the start for the most >> common use cases (batch small/large cluster, streaming high throughput/low >> latency), I think users would be able start much more quickly with a >> somewhat well-configured system and fine-tune the settings later. For >> example, Kafka Streams has a section on how to set the parameters for >> maximum resilience [2]. >> >> I'd propose to leave the current configuration page as a reference page, >> but also have a recommended configuration settings page that's directly >> linked in the first section, such that new users are not overwhelmed. >> >> Sorry if this response is hijacking the discussion. >> Btw, is restart-strategy configuration missing in the main configuration >> page? Is this a conscious decision? >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options >> [2] >> https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency >> >> On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <reed...@gmail.com> wrote: >> >>> 1s looks good to me. >>> And I think the conclusion that when a user should override the delay is >>> worth to be documented. >>> >>> Thanks, >>> Zhu Zhu >>> >>> Steven Wu <stevenz...@gmail.com> 于2019年9月3日周二 上午4:42写道: >>> >>>> 1s sounds a good tradeoff to me. >>>> >>>> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <trohrm...@apache.org> >>>> wrote: >>>> >>>>> Thanks a lot for all your feedback. I see there is a slight tendency >>>>> towards having a non zero default delay so far. >>>>> >>>>> However, Yu has brought up some valid points. Maybe I can shed some >>>>> light on a). >>>>> >>>>> Before FLINK-9158 we set the default delay to 10s because Flink did >>>>> not support queued scheduling which meant that if one slot was >>>>> missing/still being occupied, then Flink would fail right away with >>>>> a NoResourceAvailableException. In order to prevent this we added the >>>>> delay. This also covered the case when the job was failing because of an >>>>> overloaded external system. >>>>> >>>>> When we finished FLIP-6, we thought that we could improve the user >>>>> experience by decreasing the default delay to 0s because all Flink related >>>>> problems (slot still occupied, slot missing because of reconnecting TM) >>>>> could be handled by the default slot request time out which allowed the >>>>> slots to become ready after the scheduling was kicked off. However, we did >>>>> not properly take the case of overloaded external systems into account. >>>>> >>>>> For b) I agree that any default value should be properly documented. >>>>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I >>>>> believe that there won't be the solve it all default value. There are >>>>> always cases where one needs to adapt it to ones needs. But this is ok. >>>>> The >>>>> goal should be to find the default value which works for most cases. >>>>> >>>>> So maybe the middle ground between 10s and 0s could be a solution. >>>>> Setting the default restart delay to 1s should prevent restart storms >>>>> caused by overloaded external systems and still be fast enough to not slow >>>>> down recoveries noticeably in most cases. If one needs a super fast >>>>> recovery, then one should set the delay value to 0s. If one requires a >>>>> longer delay because of a particular infrastructure, then one needs to >>>>> change the value too. What do you think? >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <car...@gmail.com> wrote: >>>>> >>>>>> -1 on increasing the default delay to none zero, with below reasons: >>>>>> >>>>>> a) I could see some concerns about setting the delay to zero in the >>>>>> very original JIRA (FLINK-2993 >>>>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in >>>>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we >>>>>> still decided to make the change, so I'm wondering whether the decision >>>>>> also came from any customer requirement? If so, how could we judge >>>>>> whether >>>>>> one requirement override the other? >>>>>> >>>>>> b) There could be valid reasons for both default values depending on >>>>>> different use cases, as well as relative work around (like based on >>>>>> latest >>>>>> policy, setting the config manually to 10s could resolve the problem >>>>>> mentioned), and from former replies to this thread we could see users >>>>>> have >>>>>> already taken actions. Changing it back to non-zero again won't affect >>>>>> such >>>>>> users but might cause surprises to those depending on 0 as default. >>>>>> >>>>>> Last but not least, no matter what decision we make this time, I'd >>>>>> suggest to make it final and document in our release note explicitly. >>>>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about >>>>>> the change on default restart delay and we'd better learn from it this >>>>>> time. Thanks. >>>>>> >>>>>> [1] >>>>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes >>>>>> [2] >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html >>>>>> >>>>>> Best Regards, >>>>>> Yu >>>>>> >>>>>> >>>>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <stevenz...@gmail.com> wrote: >>>>>> >>>>>>> +1 on what Zhu Zhu said. >>>>>>> >>>>>>> We also override the default to 10 s. >>>>>>> >>>>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <reed...@gmail.com> wrote: >>>>>>> >>>>>>>> In our production, we usually override the restart delay to be 10 s. >>>>>>>> We once encountered cases that external services are overwhelmed by >>>>>>>> reconnections from frequent restarted tasks. >>>>>>>> As a safer though not optimized option, a default delay larger than >>>>>>>> 0 s is better in my opinion. >>>>>>>> >>>>>>>> >>>>>>>> 未来阳光 <2217232...@qq.com> 于2019年8月30日周五 下午10:23写道: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> >>>>>>>>> I thinks it's better to increase the default value. +1 >>>>>>>>> >>>>>>>>> >>>>>>>>> Best. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------ 原始邮件 ------------------ >>>>>>>>> 发件人: "Till Rohrmann"<trohrm...@apache.org>; >>>>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07 >>>>>>>>> 收件人: "dev"<d...@flink.apache.org>; "user"<user@flink.apache.org>; >>>>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> I wanted to reach out to you and ask whether decreasing the >>>>>>>>> default delay >>>>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing >>>>>>>>> trouble. A >>>>>>>>> user reported that he would like to increase the default value >>>>>>>>> because it >>>>>>>>> can cause restart storms in case of systematic faults [2]. >>>>>>>>> >>>>>>>>> The downside of increasing the default delay would be a slightly >>>>>>>>> increased >>>>>>>>> restart time if this config option is not explicitly set. >>>>>>>>> >>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158 >>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218 >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Till >>>>>>>> >>>>>>>> >> >> -- >> >> Arvid Heise | Senior Software Engineer >> >> <https://www.ververica.com/> >> >> Follow us @VervericaData >> >> -- >> >> Join Flink Forward <https://flink-forward.org/> - The Apache Flink >> Conference >> >> Stream Processing | Event Driven | Real Time >> >> -- >> >> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany >> >> -- >> Ververica GmbH >> Registered at Amtsgericht Charlottenburg: HRB 158244 B >> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen >> >