Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Rui Fan Wed, 29 Nov 2023 04:30:45 -0800

Hi all,

The user mail[1] has started for 13 days, and it collected
one useful suggestion.


> Given that the new default feels more complex than the current behavior,
if we decide to do this I think it will be important to include the
rationale you've shared in the documentation.

I will add the related doc to explain it.

Also, based on Zhu's suggestion. This FLIP
is changed to `Improve the exponential-delay restart-strategy`.
It focuses on improving the exponential-delay restart-strategy,
and ignores the fixed delay and failure rate in this FLIP.

If you have no questions, you are welcome to vote in the
Vote thread[2], the mail title is `[VOTE] FLIP-364: Improve the
restart-strategy`,
but it can still be used as a voting thread.

Thank you to everyone who participated in the discussion.

[1] https://lists.apache.org/thread/6glz0d57r8gtpzq4f71vf9066c5x6nyw
[2] https://lists.apache.org/thread/xo03tzw6d02w1vbcj5y9ccpqyc7bqrh9

Best,
Rui

On Fri, Nov 17, 2023 at 12:02 PM Mingliang Liu <lium...@apache.org> wrote:

> Thank you Rui. It makes sense to me now.
>
> On Thu, Nov 16, 2023 at 2:57 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi all,
> >
> > Zhu and I had an offline discussion today. We prefer this FLIP
> > focuses on improving exponential-delay and uses exponential-delay
> > as the default strategy. It means this FLIP doesn't include
> > improvements related to fixed-delay and failover-delay, and the
> > second part of FLIP(Improve restartAttempt's counting strategy)
> > just improves exponential-delay.
> >
> > Following are reasons:
> >
> > 1. Judging from current discussion, many users want
> >   exponential-delay as the default restart strategy.
> >
> > 2. The semantics of naming and behavior are inconsistent
> >
> > If we improve the restartAttempt counting mechanism for all
> > restart strategies, we need to unify the concept of restartAttempt
> > counting. We want to increase based on the number of restarts,
> > not the number of failures. The number of failures will increase
> > too fast, so we hope to aggregate multiple failures into one restart.
> >
> > However, the failure-rate strategy's restart upper limit option is
> > named  restart-strategy.failure-rate.max-failures-per-interval,
> > it's  max-failures-per-interval instead of max-attempts-per-interval.
> > If we improve it directly, the name and behaviour aren't matched.
> >
> > 3. The restartAttempt counting mechanism and Exception History
> >   are not match
> >
> > If we aggregate multiple failures into one restartAttempt, one failure
> > is an exception in Exception History. Users allowed 10 attempts,
> > but saw 100 failures on the Exception History, and the job has not
> > exited yet. Users may be confused. It's related to concurrentExceptions,
> > and it will be followed at FLINK-33565.
> >
> > For these reasons, we prefer that current FLIP focus on
> exponential-delay.
> > After FLINK-33565 is done, we can discuss the rest of restart-strategies
> > again.
> >
> > Looking forward to your feedback, thanks~
> >
> > To Mingliang,
> >
> > Sorry, I missed one of your questions this morning.
> >
> > > One question is the max attempts. Is that the max attempt after which
> > the job will be deemed failed? I'm wondering if we just simplify the name
> > from `max-attempts-before-reset-backoff` to `max-attempts` or just
> > `attempts` > (like the static strategy
> > `restart-strategy.fixed-delay.attempts`). The wording
> `before-reset-backoff
> > ` makes me think it's setting the backoff interval to its initial value
> > after this max attempt, instead of failing the job.
> >
> > The max-attempts-before-reset-backoff isn't the same with max-attempts or
> > attempts.
> > The exponential-delay has a reset mechanism, when no exception within
> > reset-backoff-threshold. Flink will reset the delay time to
> > initial-backoff.
> > max-attempts-before-reset-backoff indicates the maximum number of
> restarts
> > we can attempt before resetting.
> > - When restartAttempt > max-attempts-before-reset-backoff, the job will
> > exit.
> > - When no exception within reset-backoff-threshold, the delay time will
> be
> >   reset to initial-backoff, and restartAttempt will be reset to 0 as
> well.
> >
> > After your feedback, I think attempts-before-reset-backoff may be better,
> > the max can be removed, and it is like
> > `restart-strategy.fixed-delay.attempts`.
> > WDYT?
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-33565
> >
> > Best,
> > Rui
> >
> > On Thu, Nov 16, 2023 at 11:48 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> >> Hi Zhu and Matthias:
> >>
> >> > 3. failure counting
> >> > Flink currently will try to recognize concurrent failures and group
> them
> >> > together, which can be seen in the web UI. So how about to align the
> >> > failure counting with the concurrent failures computing? This can make
> >> it
> >> > more consistent and easier for understanding. It will require changes
> to
> >> > the concurrent failures computing though, i.e. taking the backoff time
> >> > into consideration. So maybe we can open a seperate FLIP for this
> >> change.
> >>
> >> I recently analyzed concurrentExceptions in detail, and after
> >> double-checking
> >> with Matthias who is the contributor of exception history. We found
> >> the concurrentExceptions doesn't work, it's always empty in production.
> >> I created FLINK-33565[1] to follow it.
> >>
> >> To Zhu:
> >>
> >> Discussed with Matthias, we prefer it as a separate JIRA, and
> >> FLIP-364 doesn't include it due to it's a separate bug. WDYT?
> >>
> >> Thanks Zhu mentioned the concurrentExceptions, and thanks Matthias
> >> help double check.
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-33565
> >>
> >> Best,
> >> Rui
> >>
> >> On Thu, Nov 16, 2023 at 11:39 AM Rui Fan <1996fan...@gmail.com> wrote:
> >>
> >>> Hi Zhu, Jing and Mingliang:
> >>>
> >>> Thanks for your feedback about consider exponential-delay
> >>> as the default restart-strategy, and updating the default
> >>> values of exponential-delay as well. I have started a
> >>> discussion on user, user-zh and dev mail list about it[1].
> >>>
> >>> [1] https://lists.apache.org/thread/6glz0d57r8gtpzq4f71vf9066c5x6nyw
> >>>
> >>> Best,
> >>> Rui
> >>>
> >>> On Thu, Nov 16, 2023 at 6:35 AM Mingliang Liu <lium...@apache.org>
> >>> wrote:
> >>>
> >>>> Thanks for sharing your data points.
> >>>>
> >>>> Among a few thousand jobs (from the smallest 1 task manager and the
> >>>> largest 300+ task managers), I presume most of them use the default.
> >>>> However, the default values we have been using were not broadly
> discussed
> >>>> but instead based on a priori knowledge as we manage many jobs for our
> >>>> (internal) customers. So I believe it's a good idea to engage with
> user ML
> >>>> for more feedback. Currently we rely on the two explicit config:
> >>>>
> >>>>> restart-strategy.exponential-delay.initial-backoff: 5 s
> >>>>> restart-strategy.exponential-delay.max-backoff: 2 min
> >>>>
> >>>>
> >>>> I think the default values in the FLIP look good to me overall, though
> >>>> I completely understand that the one-size-fits-all default values do
> not
> >>>> exist. Specifically, a multiplier value between 1 and 2 is more
> sensible to
> >>>> me than the existing value 2, if we enable exponential backoff as the
> >>>> default. The proposed value 1.2 is in this range. Jitter-factor being
> 0.1
> >>>> and reset threshold being 1h are both the same as existing values.
> >>>>
> >>>> One question is the max attempts. Is that the max attempt after which
> >>>> the job will be deemed failed? I'm wondering if we just simplify the
> name
> >>>> from `max-attempts-before-reset-backoff` to `max-attempts` or just
> >>>> `attempts` (like the static strategy
> >>>> `restart-strategy.fixed-delay.attempts`). The wording
> `before-reset-backoff
> >>>> ` makes me think it's setting the backoff interval to its initial
> value
> >>>> after this max attempt, instead of failing the job.
> >>>>
> >>>> On Tue, Nov 14, 2023 at 8:16 PM Rui Fan <1996fan...@gmail.com> wrote:
> >>>>
> >>>>> Hi Mingliang:
> >>>>>
> >>>>> Thanks you for the feedback here!
> >>>>>
> >>>>> Glad to hear Netflix have made exponential-delay as the
> >>>>> default restart strategy. Our production(Shopee) also makes
> >>>>> exponential-delay as the default since May 2021, and the
> >>>>> current number of flink jobs far exceeds tens of thousands.
> >>>>> These jobs work well.
> >>>>>
> >>>>> Note: Our internal exponential-delay solves the problem
> >>>>> of a large number of tasks failing in a short period of time
> >>>>> causing restartAttempts to increase rapidly.
> >>>>>
> >>>>> Based on your production, do you have any suggestions
> >>>>> about default values of exponential-delay configuration?
> >>>>>
> >>>>> Zhu and Jing may also be interested in this question.
> >>>>>
> >>>>> Following are FLIP-364 proposed default values:
> >>>>>
> >>>>> restart-strategy.exponential-delay.max-attempts-before-reset-backoff
> :
> >>>>> Integer.MAX_VALUE
> >>>>> restart-strategy.exponential-delay.initial-backoff : 1s
> >>>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2
> >>>>> restart-strategy.exponential-delay.jitter-factor : 0.1
> >>>>> restart-strategy.exponential-delay.max-backoff : 1 min
> >>>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
> >>>>>
> >>>>> Looking forward to your feedback! And I will start a discussion
> >>>>> on user mail list to collect more feedback.
> >>>>>
> >>>>> In addition, I understand that the community needs to consider
> >>>>> a lot of compatibility and risks when modifying the default value.
> >>>>> If this is very difficult to reach consensus on, I can remove
> >>>>> this item from FLIP.
> >>>>>
> >>>>> Best,
> >>>>> Rui
> >>>>>
> >>>>> On Wed, Nov 15, 2023 at 6:40 AM Mingliang Liu <lium...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Thanks Rui for driving this. I just call out that making
> >>>>>> exponential-delay
> >>>>>> the default is a good change. At Netflix, we have enabled this as
> the
> >>>>>> default restart strategy 2 quarters ago and it has been working
> well.
> >>>>>> Keeping it restarting indefinitely by default makes sense to me.
> >>>>>>
> >>>>>> On Mon, Oct 16, 2023 at 10:11 PM Rui Fan <1996fan...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> > Hi all,
> >>>>>> >
> >>>>>> > I would like to start a discussion on FLIP-364: Improve the
> >>>>>> > restart-strategy[1]
> >>>>>> >
> >>>>>> > As we know, the restart-strategy is critical for flink jobs, it
> >>>>>> mainly
> >>>>>> > has two functions:
> >>>>>> > 1. When an exception occurs in the flink job, quickly restart the
> >>>>>> job
> >>>>>> > so that the job can return to the running state.
> >>>>>> > 2. When a job cannot be recovered after frequent restarts within
> >>>>>> > a certain period of time, Flink will not retry but will fail the
> >>>>>> job.
> >>>>>> >
> >>>>>> > The current restart-strategy support for function 2 has some
> issues:
> >>>>>> > 1. The exponential-delay doesn't have the max attempts mechanism,
> >>>>>> > it means that flink will restart indefinitely even if it fails
> >>>>>> frequently.
> >>>>>> > 2. For multi-region streaming jobs and all batch jobs, the failure
> >>>>>> of
> >>>>>> > each region will increase the total number of job failures by +1,
> >>>>>> > even if these failures occur at the same time. If the number of
> >>>>>> > failures increases too quickly, it will be difficult to set a
> >>>>>> reasonable
> >>>>>> > number of retries.
> >>>>>> > If the maximum number of failures is set too low, the job can
> easily
> >>>>>> > reach the retry limit, causing the job to fail. If set too high,
> >>>>>> some jobs
> >>>>>> > will never fail.
> >>>>>> >
> >>>>>> > In addition, when the above two problems are solved, we can also
> >>>>>> > discuss whether exponential-delay can replace fixed-delay as the
> >>>>>> > default restart-strategy. In theory, exponential-delay is smarter
> >>>>>> and
> >>>>>> > friendlier than fixed-delay.
> >>>>>> >
> >>>>>> > I also thank Zhu Zhu for his suggestions on the option name in
> >>>>>> > FLINK-32895[2] in advance.
> >>>>>> >
> >>>>>> > Looking forward to and welcome everyone's feedback and
> suggestions,
> >>>>>> thank
> >>>>>> > you.
> >>>>>> >
> >>>>>> > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> >>>>>> > [2] https://issues.apache.org/jira/browse/FLINK-32895
> >>>>>> >
> >>>>>> > Best,
> >>>>>> > Rui
> >>>>>> >
> >>>>>>
> >>>>>
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Reply via email to