Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Rui Fan Wed, 15 Nov 2023 19:50:55 -0800

Hi Zhu and Matthias:

> 3. failure counting
> Flink currently will try to recognize concurrent failures and group them
> together, which can be seen in the web UI. So how about to align the
> failure counting with the concurrent failures computing? This can make it
> more consistent and easier for understanding. It will require changes to
> the concurrent failures computing though, i.e. taking the backoff time
> into consideration. So maybe we can open a seperate FLIP for this change.


I recently analyzed concurrentExceptions in detail, and after
double-checking
with Matthias who is the contributor of exception history. We found
the concurrentExceptions doesn't work, it's always empty in production.
I created FLINK-33565[1] to follow it.

To Zhu:

Discussed with Matthias, we prefer it as a separate JIRA, and
FLIP-364 doesn't include it due to it's a separate bug. WDYT?

Thanks Zhu mentioned the concurrentExceptions, and thanks Matthias
help double check.

[1] https://issues.apache.org/jira/browse/FLINK-33565

Best,
Rui

On Thu, Nov 16, 2023 at 11:39 AM Rui Fan <[email protected]> wrote:

> Hi Zhu, Jing and Mingliang:
>
> Thanks for your feedback about consider exponential-delay
> as the default restart-strategy, and updating the default
> values of exponential-delay as well. I have started a
> discussion on user, user-zh and dev mail list about it[1].
>
> [1] https://lists.apache.org/thread/6glz0d57r8gtpzq4f71vf9066c5x6nyw
>
> Best,
> Rui
>
> On Thu, Nov 16, 2023 at 6:35 AM Mingliang Liu <[email protected]> wrote:
>
>> Thanks for sharing your data points.
>>
>> Among a few thousand jobs (from the smallest 1 task manager and the
>> largest 300+ task managers), I presume most of them use the default.
>> However, the default values we have been using were not broadly discussed
>> but instead based on a priori knowledge as we manage many jobs for our
>> (internal) customers. So I believe it's a good idea to engage with user ML
>> for more feedback. Currently we rely on the two explicit config:
>>
>>> restart-strategy.exponential-delay.initial-backoff: 5 s
>>> restart-strategy.exponential-delay.max-backoff: 2 min
>>
>>
>> I think the default values in the FLIP look good to me overall, though I
>> completely understand that the one-size-fits-all default values do not
>> exist. Specifically, a multiplier value between 1 and 2 is more sensible to
>> me than the existing value 2, if we enable exponential backoff as the
>> default. The proposed value 1.2 is in this range. Jitter-factor being 0.1
>> and reset threshold being 1h are both the same as existing values.
>>
>> One question is the max attempts. Is that the max attempt after which the
>> job will be deemed failed? I'm wondering if we just simplify the name from
>> `max-attempts-before-reset-backoff` to `max-attempts` or just `attempts`
>> (like the static strategy `restart-strategy.fixed-delay.attempts`). The
>> wording `before-reset-backoff ` makes me think it's setting the backoff
>> interval to its initial value after this max attempt, instead of failing
>> the job.
>>
>> On Tue, Nov 14, 2023 at 8:16 PM Rui Fan <[email protected]> wrote:
>>
>>> Hi Mingliang:
>>>
>>> Thanks you for the feedback here!
>>>
>>> Glad to hear Netflix have made exponential-delay as the
>>> default restart strategy. Our production(Shopee) also makes
>>> exponential-delay as the default since May 2021, and the
>>> current number of flink jobs far exceeds tens of thousands.
>>> These jobs work well.
>>>
>>> Note: Our internal exponential-delay solves the problem
>>> of a large number of tasks failing in a short period of time
>>> causing restartAttempts to increase rapidly.
>>>
>>> Based on your production, do you have any suggestions
>>> about default values of exponential-delay configuration?
>>>
>>> Zhu and Jing may also be interested in this question.
>>>
>>> Following are FLIP-364 proposed default values:
>>>
>>> restart-strategy.exponential-delay.max-attempts-before-reset-backoff :
>>> Integer.MAX_VALUE
>>> restart-strategy.exponential-delay.initial-backoff : 1s
>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2
>>> restart-strategy.exponential-delay.jitter-factor : 0.1
>>> restart-strategy.exponential-delay.max-backoff : 1 min
>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
>>>
>>> Looking forward to your feedback! And I will start a discussion
>>> on user mail list to collect more feedback.
>>>
>>> In addition, I understand that the community needs to consider
>>> a lot of compatibility and risks when modifying the default value.
>>> If this is very difficult to reach consensus on, I can remove
>>> this item from FLIP.
>>>
>>> Best,
>>> Rui
>>>
>>> On Wed, Nov 15, 2023 at 6:40 AM Mingliang Liu <[email protected]>
>>> wrote:
>>>
>>>> Thanks Rui for driving this. I just call out that making
>>>> exponential-delay
>>>> the default is a good change. At Netflix, we have enabled this as the
>>>> default restart strategy 2 quarters ago and it has been working well.
>>>> Keeping it restarting indefinitely by default makes sense to me.
>>>>
>>>> On Mon, Oct 16, 2023 at 10:11 PM Rui Fan <[email protected]> wrote:
>>>>
>>>> > Hi all,
>>>> >
>>>> > I would like to start a discussion on FLIP-364: Improve the
>>>> > restart-strategy[1]
>>>> >
>>>> > As we know, the restart-strategy is critical for flink jobs, it mainly
>>>> > has two functions:
>>>> > 1. When an exception occurs in the flink job, quickly restart the job
>>>> > so that the job can return to the running state.
>>>> > 2. When a job cannot be recovered after frequent restarts within
>>>> > a certain period of time, Flink will not retry but will fail the job.
>>>> >
>>>> > The current restart-strategy support for function 2 has some issues:
>>>> > 1. The exponential-delay doesn't have the max attempts mechanism,
>>>> > it means that flink will restart indefinitely even if it fails
>>>> frequently.
>>>> > 2. For multi-region streaming jobs and all batch jobs, the failure of
>>>> > each region will increase the total number of job failures by +1,
>>>> > even if these failures occur at the same time. If the number of
>>>> > failures increases too quickly, it will be difficult to set a
>>>> reasonable
>>>> > number of retries.
>>>> > If the maximum number of failures is set too low, the job can easily
>>>> > reach the retry limit, causing the job to fail. If set too high, some
>>>> jobs
>>>> > will never fail.
>>>> >
>>>> > In addition, when the above two problems are solved, we can also
>>>> > discuss whether exponential-delay can replace fixed-delay as the
>>>> > default restart-strategy. In theory, exponential-delay is smarter and
>>>> > friendlier than fixed-delay.
>>>> >
>>>> > I also thank Zhu Zhu for his suggestions on the option name in
>>>> > FLINK-32895[2] in advance.
>>>> >
>>>> > Looking forward to and welcome everyone's feedback and suggestions,
>>>> thank
>>>> > you.
>>>> >
>>>> > [1] https://cwiki.apache.org/confluence/x/uJqzDw
>>>> > [2] https://issues.apache.org/jira/browse/FLINK-32895
>>>> >
>>>> > Best,
>>>> > Rui
>>>> >
>>>>
>>>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Reply via email to