Hi Konstantin and Max, Thanks for your feedback!
Sorry, I forgot to mention the default value of `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`. Retrying forever sounds good to me, I have added it to the FLIP: The default value of `restart-strategy.exponential-delay.max-attempts-before-reset-backoff` is Integer.MAX_VALUE. Best, Rui On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels <[email protected]> wrote: > Hey Rui, > > +1 for making exponential backoff the default. I agree with Konstantin > that retrying forever is a good default for exponential backoff > because oftentimes the issue will resolve eventually. The purpose of > exponential backoff is precisely to continue to retry without causing > too much load. However, I'm not against adding an optional max number > of retries. > > -Max > > On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf <[email protected]> > wrote: > > > > Hi Rui, > > > > Thank you for this proposal and working on this. I also agree that > > exponential back off makes sense as a new default in general. I think > > restarting indefinitely (no max attempts) makes sense by default, though, > > but of course allowing users to change is valuable. > > > > So, overall +1. > > > > Cheers, > > > > Konstantin > > > > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <[email protected] > >: > > > > > Hi all, > > > > > > I would like to start a discussion on FLIP-364: Improve the > > > restart-strategy[1] > > > > > > As we know, the restart-strategy is critical for flink jobs, it mainly > > > has two functions: > > > 1. When an exception occurs in the flink job, quickly restart the job > > > so that the job can return to the running state. > > > 2. When a job cannot be recovered after frequent restarts within > > > a certain period of time, Flink will not retry but will fail the job. > > > > > > The current restart-strategy support for function 2 has some issues: > > > 1. The exponential-delay doesn't have the max attempts mechanism, > > > it means that flink will restart indefinitely even if it fails > frequently. > > > 2. For multi-region streaming jobs and all batch jobs, the failure of > > > each region will increase the total number of job failures by +1, > > > even if these failures occur at the same time. If the number of > > > failures increases too quickly, it will be difficult to set a > reasonable > > > number of retries. > > > If the maximum number of failures is set too low, the job can easily > > > reach the retry limit, causing the job to fail. If set too high, some > jobs > > > will never fail. > > > > > > In addition, when the above two problems are solved, we can also > > > discuss whether exponential-delay can replace fixed-delay as the > > > default restart-strategy. In theory, exponential-delay is smarter and > > > friendlier than fixed-delay. > > > > > > I also thank Zhu Zhu for his suggestions on the option name in > > > FLINK-32895[2] in advance. > > > > > > Looking forward to and welcome everyone's feedback and suggestions, > thank > > > you. > > > > > > [1] https://cwiki.apache.org/confluence/x/uJqzDw > > > [2] https://issues.apache.org/jira/browse/FLINK-32895 > > > > > > Best, > > > Rui > > > > > > > > > -- > > https://twitter.com/snntrable > > https://github.com/knaufk >
