Re: [DISCUSS] Change the default restart-strategy to exponential-delay

Rui Fan Tue, 19 Dec 2023 01:23:27 -0800

Thanks everyone for the feedback!

It doesn't have more feedback here, so I started the new vote[1]
just now to update the default value of backoff-multiplier from
1.2 to 1.5.


[1] https://lists.apache.org/thread/0b1dcwb49owpm6v1j8rhrg9h0fvs5nkt

Best,
Rui

On Tue, Dec 12, 2023 at 7:14 PM Maximilian Michels <[email protected]> wrote:

> Thank you Rui! I think a 1.5 multiplier is a reasonable tradeoff
> between restarting fast but not putting too much pressure on the
> cluster due to restarts.
>
> -Max
>
> On Tue, Dec 12, 2023 at 8:19 AM Rui Fan <[email protected]> wrote:
> >
> > Hi Maximilian and Mason,
> >
> > Thanks a lot for your feedback!
> >
> > After an offline consultation with Max, I guess I understand your
> > concern for now: when flink job restarts, it will make a bunch of
> > calls to the Kubernetes API, e.g. read/write to config maps, create
> > task managers. Currently, the default restart strategy is fixed-delay
> > with 1s delay time, so flink will restart jobs with high frequency
> > even if flink jobs cannot be started. It will cause the Kubernetes
> > cluster became unstable.
> >
> > That's why I propose changing the default restart strategy to
> > exponential-delay. It can achieve: restarts happen quickly
> > enough unless there are consecutive failures. It is helpful for
> > the stability of external components.
> >
> > After discussing with Max and Zhu Zhu at the PR comment[1],
> > Max suggested using 1.5 as the default value of backoff-multiplier
> > instead of 1.2. The 1.2 is a little small(delay time is too short).
> > This picture[2] is the relationship between restart-attempts and
> > retry-delay-time when backoff-multiplier is 1.2 and 1.5:
> >
> > - The delay-time will reach 1 min after 12 attempts when
> backoff-multiplier is 1.5
> > - The delay-time will reach 1 min after 24 attempts when
> backoff-multiplier is 1.2
> >
> > Is there any other suggestion? Looking forward to more feedback, thanks~
> >
> > BTW, as Zhu said in the comment[1], if we update the default value,
> > a new vote is needed for this default value. So I will pause
> > FLINK-33736[1] first, and the rest of the JIRAs of FLIP-364 will be
> > continued.
> >
> > To Mason:
> >
> > If I understand your concerns correctly, I still don't know how
> > to benchmark. The kubernetes cluster instability only happens
> > when one cluster has a lot of jobs. In general, the test cannot
> > reproduce the pressure. Could you elaborate on how to
> > benchmark for this?
> >
> > After this FLIP, the default restart frequency will be reduced
> > significantly. Especially when a job fails consecutively.
> > Do you think the benchmark is necessary?
> >
> > Looking forward to your feedback, thanks~
> >
> > [1] https://github.com/apache/flink/pull/23247#discussion_r1422626734
> > [2]
> https://github.com/apache/flink/assets/38427477/642c57e0-b415-4326-af05-8b506c5fbb3a
> > [3] https://issues.apache.org/jira/browse/FLINK-33736
> >
> > Best,
> > Rui
> >
> > On Thu, Dec 7, 2023 at 10:57 PM Maximilian Michels <[email protected]>
> wrote:
> >>
> >> Hey Rui,
> >>
> >> +1 for changing the default restart strategy to exponential-delay.
> >> This is something all users eventually run into. They end up changing
> >> the restart strategy to exponential-delay. I think the current
> >> defaults are quite balanced. Restarts happen quickly enough unless
> >> there are consecutive failures where I think it makes sense to double
> >> the waiting time up till the max.
> >>
> >> -Max
> >>
> >>
> >> On Wed, Dec 6, 2023 at 12:51 AM Mason Chen <[email protected]>
> wrote:
> >> >
> >> > Hi Rui,
> >> >
> >> > Sorry for the late reply. I was suggesting that perhaps we could do
> some
> >> > testing with Kubernetes wrt configuring values for the exponential
> restart
> >> > strategy. We've noticed that the default strategy in 1.17 caused a
> lot of
> >> > requests to the K8s API server for unstable deployments.
> >> >
> >> > However, people in different Kubernetes setups will have different
> limits
> >> > so it would be challenging to provide a general benchmark. Another
> thing I
> >> > found helpful in the past is to refer to Kubernetes--for example, the
> >> > default strategy is exponential for pod restarts and we could draw
> >> > inspiration from what they have set as a general purpose default
> config.
> >> >
> >> > Best,
> >> > Mason
> >> >
> >> > On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <[email protected]> wrote:
> >> >
> >> > > Hi David and Mason,
> >> > >
> >> > > Thanks for your feedback!
> >> > >
> >> > > To David:
> >> > >
> >> > > > Given that the new default feels more complex than the current
> behavior,
> >> > > if we decide to do this I think it will be important to include the
> >> > > rationale you've shared in the documentation.
> >> > >
> >> > > Sounds make sense to me, I will add the related doc if we
> >> > > update the default strategy.
> >> > >
> >> > > To Mason:
> >> > >
> >> > > > I suppose we could do some benchmarking on what works well for the
> >> > > resource providers that Flink relies on e.g. Kubernetes. Based on
> >> > > conferences and blogs,
> >> > > > it seems most people are relying on Kubernetes to deploy Flink
> and the
> >> > > restart strategy has a large dependency on how well Kubernetes can
> scale to
> >> > > requests to redeploy the job.
> >> > >
> >> > > Sorry, I didn't understand what type of benchmarking
> >> > > we should do, could you elaborate on it? Thanks a lot.
> >> > >
> >> > > Best,
> >> > > Rui
> >> > >
> >> > > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen <[email protected]>
> wrote:
> >> > >
> >> > >> Hi Rui,
> >> > >>
> >> > >> I suppose we could do some benchmarking on what works well for the
> >> > >> resource providers that Flink relies on e.g. Kubernetes. Based on
> >> > >> conferences and blogs, it seems most people are relying on
> Kubernetes to
> >> > >> deploy Flink and the restart strategy has a large dependency on
> how well
> >> > >> Kubernetes can scale to requests to redeploy the job.
> >> > >>
> >> > >> Best,
> >> > >> Mason
> >> > >>
> >> > >> On Fri, Nov 17, 2023 at 10:07 AM David Anderson <
> [email protected]>
> >> > >> wrote:
> >> > >>
> >> > >>> Rui,
> >> > >>>
> >> > >>> I don't have any direct experience with this topic, but given the
> >> > >>> motivation you shared, the proposal makes sense to me. Given that
> the new
> >> > >>> default feels more complex than the current behavior, if we
> decide to do
> >> > >>> this I think it will be important to include the rationale you've
> shared in
> >> > >>> the documentation.
> >> > >>>
> >> > >>> David
> >> > >>>
> >> > >>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <[email protected]>
> wrote:
> >> > >>>
> >> > >>>> Hi dear flink users and devs:
> >> > >>>>
> >> > >>>> FLIP-364[1] intends to make some improvements to restart-strategy
> >> > >>>> and discuss updating some of the default values of
> exponential-delay,
> >> > >>>> and whether exponential-delay can be used as the default
> >> > >>>> restart-strategy.
> >> > >>>> After discussing at dev mail list[2], we hope to collect more
> feedback
> >> > >>>> from Flink users.
> >> > >>>>
> >> > >>>> # Why does the default restart-strategy need to be updated?
> >> > >>>>
> >> > >>>> If checkpointing is enabled, the default value is fixed-delay
> with
> >> > >>>> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
> >> > >>>> the job will restart infinitely with high frequency when a job
> >> > >>>> continues to fail.
> >> > >>>>
> >> > >>>> When the Kafka cluster fails, a large number of flink jobs will
> be
> >> > >>>> restarted frequently. After the kafka cluster is recovered, a
> large
> >> > >>>> number of high-frequency restarts of flink jobs may cause the
> >> > >>>> kafka cluster to avalanche again.
> >> > >>>>
> >> > >>>> Considering the exponential-delay as the default strategy with
> >> > >>>> a couple of reasons:
> >> > >>>>
> >> > >>>> - The exponential-delay can reduce the restart frequency when
> >> > >>>>   a job continues to fail.
> >> > >>>> - It can restart a job quickly when a job fails occasionally.
> >> > >>>> - The restart-strategy.exponential-delay.jitter-factor can avoid
> r
> >> > >>>>   estarting multiple jobs at the same time. It’s useful to
> prevent
> >> > >>>>   avalanches.
> >> > >>>>
> >> > >>>> # What are the current default values[4] of exponential-delay?
> >> > >>>>
> >> > >>>> restart-strategy.exponential-delay.initial-backoff : 1s
> >> > >>>> restart-strategy.exponential-delay.backoff-multiplier : 2.0
> >> > >>>> restart-strategy.exponential-delay.jitter-factor : 0.1
> >> > >>>> restart-strategy.exponential-delay.max-backoff : 5 min
> >> > >>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
> >> > >>>>
> >> > >>>> backoff-multiplier=2 means that the delay time of each restart
> >> > >>>> will be doubled. The delay times are:
> >> > >>>> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.
> >> > >>>>
> >> > >>>> The delay time is increased rapidly, it will affect the recover
> >> > >>>> time for flink jobs.
> >> > >>>>
> >> > >>>> # Option improvements
> >> > >>>>
> >> > >>>> We think the backoff-multiplier between 1 and 2 is more sensible,
> >> > >>>> such as:
> >> > >>>>
> >> > >>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2
> >> > >>>> restart-strategy.exponential-delay.max-backoff : 1 min
> >> > >>>>
> >> > >>>> After updating, the delay times are:
> >> > >>>>
> >> > >>>> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
> >> > >>>> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s,
> 18.488s,
> >> > >>>> 22.186s, 26.623s, 31.948s, 38.337s, etc
> >> > >>>>
> >> > >>>> They achieve the following goals:
> >> > >>>> - When restarts are infrequent in a short period of time, flink
> can
> >> > >>>>   quickly restart the job. (For example: the retry delay time
> when
> >> > >>>>   restarting 5 times is 2.073s)
> >> > >>>> - When restarting frequently in a short period of time, flink can
> >> > >>>>   slightly reduce the restart frequency to prevent avalanches.
> >> > >>>>   (For example: the retry delay time when retrying 10 times is
> 5.1 s,
> >> > >>>>   and the retry delay time when retrying 20 times is 38s, which
> is not
> >> > >>>> very
> >> > >>>> large.)
> >> > >>>>
> >> > >>>> As @Mingliang Liu <[email protected]>  mentioned at dev mail
> list: the
> >> > >>>> one-size-fits-all
> >> > >>>> default values do not exist. So our goal is that the default
> values
> >> > >>>> can be suitable for most jobs.
> >> > >>>>
> >> > >>>> Looking forward to your thoughts and feedback, thanks~
> >> > >>>>
> >> > >>>> [1] https://cwiki.apache.org/confluence/x/uJqzDw
> >> > >>>> [2]
> https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
> >> > >>>> [3]
> >> > >>>>
> >> > >>>>
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
> >> > >>>> [4]
> >> > >>>>
> >> > >>>>
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy
> >> > >>>>
> >> > >>>> Best,
> >> > >>>> Rui
> >> > >>>>
> >> > >>>
>

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

Reply via email to