Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Mingliang Liu Wed, 15 Nov 2023 14:35:42 -0800

Thanks for sharing your data points.

Among a few thousand jobs (from the smallest 1 task manager and the largest
300+ task managers), I presume most of them use the default. However, the
default values we have been using were not broadly discussed but instead
based on a priori knowledge as we manage many jobs for our (internal)
customers. So I believe it's a good idea to engage with user ML for more
feedback. Currently we rely on the two explicit config:


> restart-strategy.exponential-delay.initial-backoff: 5 s
> restart-strategy.exponential-delay.max-backoff: 2 min


I think the default values in the FLIP look good to me overall, though I
completely understand that the one-size-fits-all default values do not
exist. Specifically, a multiplier value between 1 and 2 is more sensible to
me than the existing value 2, if we enable exponential backoff as the
default. The proposed value 1.2 is in this range. Jitter-factor being 0.1
and reset threshold being 1h are both the same as existing values.

One question is the max attempts. Is that the max attempt after which the
job will be deemed failed? I'm wondering if we just simplify the name from
`max-attempts-before-reset-backoff` to `max-attempts` or just `attempts`
(like the static strategy `restart-strategy.fixed-delay.attempts`). The
wording `before-reset-backoff ` makes me think it's setting the backoff
interval to its initial value after this max attempt, instead of failing
the job.

On Tue, Nov 14, 2023 at 8:16 PM Rui Fan <[email protected]> wrote:

> Hi Mingliang:
>
> Thanks you for the feedback here!
>
> Glad to hear Netflix have made exponential-delay as the
> default restart strategy. Our production(Shopee) also makes
> exponential-delay as the default since May 2021, and the
> current number of flink jobs far exceeds tens of thousands.
> These jobs work well.
>
> Note: Our internal exponential-delay solves the problem
> of a large number of tasks failing in a short period of time
> causing restartAttempts to increase rapidly.
>
> Based on your production, do you have any suggestions
> about default values of exponential-delay configuration?
>
> Zhu and Jing may also be interested in this question.
>
> Following are FLIP-364 proposed default values:
>
> restart-strategy.exponential-delay.max-attempts-before-reset-backoff :
> Integer.MAX_VALUE
> restart-strategy.exponential-delay.initial-backoff : 1s
> restart-strategy.exponential-delay.backoff-multiplier : 1.2
> restart-strategy.exponential-delay.jitter-factor : 0.1
> restart-strategy.exponential-delay.max-backoff : 1 min
> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
>
> Looking forward to your feedback! And I will start a discussion
> on user mail list to collect more feedback.
>
> In addition, I understand that the community needs to consider
> a lot of compatibility and risks when modifying the default value.
> If this is very difficult to reach consensus on, I can remove
> this item from FLIP.
>
> Best,
> Rui
>
> On Wed, Nov 15, 2023 at 6:40 AM Mingliang Liu <[email protected]> wrote:
>
>> Thanks Rui for driving this. I just call out that making exponential-delay
>> the default is a good change. At Netflix, we have enabled this as the
>> default restart strategy 2 quarters ago and it has been working well.
>> Keeping it restarting indefinitely by default makes sense to me.
>>
>> On Mon, Oct 16, 2023 at 10:11 PM Rui Fan <[email protected]> wrote:
>>
>> > Hi all,
>> >
>> > I would like to start a discussion on FLIP-364: Improve the
>> > restart-strategy[1]
>> >
>> > As we know, the restart-strategy is critical for flink jobs, it mainly
>> > has two functions:
>> > 1. When an exception occurs in the flink job, quickly restart the job
>> > so that the job can return to the running state.
>> > 2. When a job cannot be recovered after frequent restarts within
>> > a certain period of time, Flink will not retry but will fail the job.
>> >
>> > The current restart-strategy support for function 2 has some issues:
>> > 1. The exponential-delay doesn't have the max attempts mechanism,
>> > it means that flink will restart indefinitely even if it fails
>> frequently.
>> > 2. For multi-region streaming jobs and all batch jobs, the failure of
>> > each region will increase the total number of job failures by +1,
>> > even if these failures occur at the same time. If the number of
>> > failures increases too quickly, it will be difficult to set a reasonable
>> > number of retries.
>> > If the maximum number of failures is set too low, the job can easily
>> > reach the retry limit, causing the job to fail. If set too high, some
>> jobs
>> > will never fail.
>> >
>> > In addition, when the above two problems are solved, we can also
>> > discuss whether exponential-delay can replace fixed-delay as the
>> > default restart-strategy. In theory, exponential-delay is smarter and
>> > friendlier than fixed-delay.
>> >
>> > I also thank Zhu Zhu for his suggestions on the option name in
>> > FLINK-32895[2] in advance.
>> >
>> > Looking forward to and welcome everyone's feedback and suggestions,
>> thank
>> > you.
>> >
>> > [1] https://cwiki.apache.org/confluence/x/uJqzDw
>> > [2] https://issues.apache.org/jira/browse/FLINK-32895
>> >
>> > Best,
>> > Rui
>> >
>>
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Reply via email to