Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Rui Fan Thu, 19 Oct 2023 04:01:01 -0700

Hi Konstantin and Max,

Thanks for your feedback!


Sorry, I forgot to mention the default value of
`restart-strategy.exponential-delay.max-attempts-before-reset-backoff`.

Retrying forever sounds good to me, I have added it to the FLIP:

The default value of
`restart-strategy.exponential-delay.max-attempts-before-reset-backoff` is
Integer.MAX_VALUE.

Best,
Rui

On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels <m...@apache.org> wrote:

> Hey Rui,
>
> +1 for making exponential backoff the default. I agree with Konstantin
> that retrying forever is a good default for exponential backoff
> because oftentimes the issue will resolve eventually. The purpose of
> exponential backoff is precisely to continue to retry without causing
> too much load. However, I'm not against adding an optional max number
> of retries.
>
> -Max
>
> On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf <kna...@apache.org>
> wrote:
> >
> > Hi Rui,
> >
> > Thank you for this proposal and working on this. I also agree that
> > exponential back off makes sense as a new default in general. I think
> > restarting indefinitely (no max attempts) makes sense by default, though,
> > but of course allowing users to change is valuable.
> >
> > So, overall +1.
> >
> > Cheers,
> >
> > Konstantin
> >
> > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <1996fan...@gmail.com
> >:
> >
> > > Hi all,
> > >
> > > I would like to start a discussion on FLIP-364: Improve the
> > > restart-strategy[1]
> > >
> > > As we know, the restart-strategy is critical for flink jobs, it mainly
> > > has two functions:
> > > 1. When an exception occurs in the flink job, quickly restart the job
> > > so that the job can return to the running state.
> > > 2. When a job cannot be recovered after frequent restarts within
> > > a certain period of time, Flink will not retry but will fail the job.
> > >
> > > The current restart-strategy support for function 2 has some issues:
> > > 1. The exponential-delay doesn't have the max attempts mechanism,
> > > it means that flink will restart indefinitely even if it fails
> frequently.
> > > 2. For multi-region streaming jobs and all batch jobs, the failure of
> > > each region will increase the total number of job failures by +1,
> > > even if these failures occur at the same time. If the number of
> > > failures increases too quickly, it will be difficult to set a
> reasonable
> > > number of retries.
> > > If the maximum number of failures is set too low, the job can easily
> > > reach the retry limit, causing the job to fail. If set too high, some
> jobs
> > > will never fail.
> > >
> > > In addition, when the above two problems are solved, we can also
> > > discuss whether exponential-delay can replace fixed-delay as the
> > > default restart-strategy. In theory, exponential-delay is smarter and
> > > friendlier than fixed-delay.
> > >
> > > I also thank Zhu Zhu for his suggestions on the option name in
> > > FLINK-32895[2] in advance.
> > >
> > > Looking forward to and welcome everyone's feedback and suggestions,
> thank
> > > you.
> > >
> > > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> > > [2] https://issues.apache.org/jira/browse/FLINK-32895
> > >
> > > Best,
> > > Rui
> > >
> >
> >
> > --
> > https://twitter.com/snntrable
> > https://github.com/knaufk
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Reply via email to