Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Rui Fan Fri, 10 Nov 2023 02:22:22 -0800

I'll start voting next Monday if there isn't any other comment.

Best,
Rui


On Thu, Oct 19, 2023 at 6:59 PM Rui Fan <[email protected]> wrote:

> Hi Konstantin and Max,
>
> Thanks for your feedback!
>
> Sorry, I forgot to mention the default value of
> `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`.
>
> Retrying forever sounds good to me, I have added it to the FLIP:
>
> The default value of
> `restart-strategy.exponential-delay.max-attempts-before-reset-backoff` is
> Integer.MAX_VALUE.
>
> Best,
> Rui
>
> On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels <[email protected]> wrote:
>
>> Hey Rui,
>>
>> +1 for making exponential backoff the default. I agree with Konstantin
>> that retrying forever is a good default for exponential backoff
>> because oftentimes the issue will resolve eventually. The purpose of
>> exponential backoff is precisely to continue to retry without causing
>> too much load. However, I'm not against adding an optional max number
>> of retries.
>>
>> -Max
>>
>> On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf <[email protected]>
>> wrote:
>> >
>> > Hi Rui,
>> >
>> > Thank you for this proposal and working on this. I also agree that
>> > exponential back off makes sense as a new default in general. I think
>> > restarting indefinitely (no max attempts) makes sense by default,
>> though,
>> > but of course allowing users to change is valuable.
>> >
>> > So, overall +1.
>> >
>> > Cheers,
>> >
>> > Konstantin
>> >
>> > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <
>> [email protected]>:
>> >
>> > > Hi all,
>> > >
>> > > I would like to start a discussion on FLIP-364: Improve the
>> > > restart-strategy[1]
>> > >
>> > > As we know, the restart-strategy is critical for flink jobs, it mainly
>> > > has two functions:
>> > > 1. When an exception occurs in the flink job, quickly restart the job
>> > > so that the job can return to the running state.
>> > > 2. When a job cannot be recovered after frequent restarts within
>> > > a certain period of time, Flink will not retry but will fail the job.
>> > >
>> > > The current restart-strategy support for function 2 has some issues:
>> > > 1. The exponential-delay doesn't have the max attempts mechanism,
>> > > it means that flink will restart indefinitely even if it fails
>> frequently.
>> > > 2. For multi-region streaming jobs and all batch jobs, the failure of
>> > > each region will increase the total number of job failures by +1,
>> > > even if these failures occur at the same time. If the number of
>> > > failures increases too quickly, it will be difficult to set a
>> reasonable
>> > > number of retries.
>> > > If the maximum number of failures is set too low, the job can easily
>> > > reach the retry limit, causing the job to fail. If set too high, some
>> jobs
>> > > will never fail.
>> > >
>> > > In addition, when the above two problems are solved, we can also
>> > > discuss whether exponential-delay can replace fixed-delay as the
>> > > default restart-strategy. In theory, exponential-delay is smarter and
>> > > friendlier than fixed-delay.
>> > >
>> > > I also thank Zhu Zhu for his suggestions on the option name in
>> > > FLINK-32895[2] in advance.
>> > >
>> > > Looking forward to and welcome everyone's feedback and suggestions,
>> thank
>> > > you.
>> > >
>> > > [1] https://cwiki.apache.org/confluence/x/uJqzDw
>> > > [2] https://issues.apache.org/jira/browse/FLINK-32895
>> > >
>> > > Best,
>> > > Rui
>> > >
>> >
>> >
>> > --
>> > https://twitter.com/snntrable
>> > https://github.com/knaufk
>>
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Reply via email to