I'll start voting next Monday if there isn't any other comment. Best, Rui
On Thu, Oct 19, 2023 at 6:59 PM Rui Fan <1996fan...@gmail.com> wrote: > Hi Konstantin and Max, > > Thanks for your feedback! > > Sorry, I forgot to mention the default value of > `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`. > > Retrying forever sounds good to me, I have added it to the FLIP: > > The default value of > `restart-strategy.exponential-delay.max-attempts-before-reset-backoff` is > Integer.MAX_VALUE. > > Best, > Rui > > On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels <m...@apache.org> wrote: > >> Hey Rui, >> >> +1 for making exponential backoff the default. I agree with Konstantin >> that retrying forever is a good default for exponential backoff >> because oftentimes the issue will resolve eventually. The purpose of >> exponential backoff is precisely to continue to retry without causing >> too much load. However, I'm not against adding an optional max number >> of retries. >> >> -Max >> >> On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf <kna...@apache.org> >> wrote: >> > >> > Hi Rui, >> > >> > Thank you for this proposal and working on this. I also agree that >> > exponential back off makes sense as a new default in general. I think >> > restarting indefinitely (no max attempts) makes sense by default, >> though, >> > but of course allowing users to change is valuable. >> > >> > So, overall +1. >> > >> > Cheers, >> > >> > Konstantin >> > >> > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan < >> 1996fan...@gmail.com>: >> > >> > > Hi all, >> > > >> > > I would like to start a discussion on FLIP-364: Improve the >> > > restart-strategy[1] >> > > >> > > As we know, the restart-strategy is critical for flink jobs, it mainly >> > > has two functions: >> > > 1. When an exception occurs in the flink job, quickly restart the job >> > > so that the job can return to the running state. >> > > 2. When a job cannot be recovered after frequent restarts within >> > > a certain period of time, Flink will not retry but will fail the job. >> > > >> > > The current restart-strategy support for function 2 has some issues: >> > > 1. The exponential-delay doesn't have the max attempts mechanism, >> > > it means that flink will restart indefinitely even if it fails >> frequently. >> > > 2. For multi-region streaming jobs and all batch jobs, the failure of >> > > each region will increase the total number of job failures by +1, >> > > even if these failures occur at the same time. If the number of >> > > failures increases too quickly, it will be difficult to set a >> reasonable >> > > number of retries. >> > > If the maximum number of failures is set too low, the job can easily >> > > reach the retry limit, causing the job to fail. If set too high, some >> jobs >> > > will never fail. >> > > >> > > In addition, when the above two problems are solved, we can also >> > > discuss whether exponential-delay can replace fixed-delay as the >> > > default restart-strategy. In theory, exponential-delay is smarter and >> > > friendlier than fixed-delay. >> > > >> > > I also thank Zhu Zhu for his suggestions on the option name in >> > > FLINK-32895[2] in advance. >> > > >> > > Looking forward to and welcome everyone's feedback and suggestions, >> thank >> > > you. >> > > >> > > [1] https://cwiki.apache.org/confluence/x/uJqzDw >> > > [2] https://issues.apache.org/jira/browse/FLINK-32895 >> > > >> > > Best, >> > > Rui >> > > >> > >> > >> > -- >> > https://twitter.com/snntrable >> > https://github.com/knaufk >> >