Re: [DISCUSS] Change the default restart-strategy to exponential-delay

Maximilian Michels Tue, 12 Dec 2023 03:16:00 -0800

Thank you Rui! I think a 1.5 multiplier is a reasonable tradeoff
between restarting fast but not putting too much pressure on the
cluster due to restarts.


-Max

On Tue, Dec 12, 2023 at 8:19 AM Rui Fan <[email protected]> wrote:
>
> Hi Maximilian and Mason,
>
> Thanks a lot for your feedback!
>
> After an offline consultation with Max, I guess I understand your
> concern for now: when flink job restarts, it will make a bunch of
> calls to the Kubernetes API, e.g. read/write to config maps, create
> task managers. Currently, the default restart strategy is fixed-delay
> with 1s delay time, so flink will restart jobs with high frequency
> even if flink jobs cannot be started. It will cause the Kubernetes
> cluster became unstable.
>
> That's why I propose changing the default restart strategy to
> exponential-delay. It can achieve: restarts happen quickly
> enough unless there are consecutive failures. It is helpful for
> the stability of external components.
>
> After discussing with Max and Zhu Zhu at the PR comment[1],
> Max suggested using 1.5 as the default value of backoff-multiplier
> instead of 1.2. The 1.2 is a little small(delay time is too short).
> This picture[2] is the relationship between restart-attempts and
> retry-delay-time when backoff-multiplier is 1.2 and 1.5:
>
> - The delay-time will reach 1 min after 12 attempts when backoff-multiplier 
> is 1.5
> - The delay-time will reach 1 min after 24 attempts when backoff-multiplier 
> is 1.2
>
> Is there any other suggestion? Looking forward to more feedback, thanks~
>
> BTW, as Zhu said in the comment[1], if we update the default value,
> a new vote is needed for this default value. So I will pause
> FLINK-33736[1] first, and the rest of the JIRAs of FLIP-364 will be
> continued.
>
> To Mason:
>
> If I understand your concerns correctly, I still don't know how
> to benchmark. The kubernetes cluster instability only happens
> when one cluster has a lot of jobs. In general, the test cannot
> reproduce the pressure. Could you elaborate on how to
> benchmark for this?
>
> After this FLIP, the default restart frequency will be reduced
> significantly. Especially when a job fails consecutively.
> Do you think the benchmark is necessary?
>
> Looking forward to your feedback, thanks~
>
> [1] https://github.com/apache/flink/pull/23247#discussion_r1422626734
> [2] 
> https://github.com/apache/flink/assets/38427477/642c57e0-b415-4326-af05-8b506c5fbb3a
> [3] https://issues.apache.org/jira/browse/FLINK-33736
>
> Best,
> Rui
>
> On Thu, Dec 7, 2023 at 10:57 PM Maximilian Michels <[email protected]> wrote:
>>
>> Hey Rui,
>>
>> +1 for changing the default restart strategy to exponential-delay.
>> This is something all users eventually run into. They end up changing
>> the restart strategy to exponential-delay. I think the current
>> defaults are quite balanced. Restarts happen quickly enough unless
>> there are consecutive failures where I think it makes sense to double
>> the waiting time up till the max.
>>
>> -Max
>>
>>
>> On Wed, Dec 6, 2023 at 12:51 AM Mason Chen <[email protected]> wrote:
>> >
>> > Hi Rui,
>> >
>> > Sorry for the late reply. I was suggesting that perhaps we could do some
>> > testing with Kubernetes wrt configuring values for the exponential restart
>> > strategy. We've noticed that the default strategy in 1.17 caused a lot of
>> > requests to the K8s API server for unstable deployments.
>> >
>> > However, people in different Kubernetes setups will have different limits
>> > so it would be challenging to provide a general benchmark. Another thing I
>> > found helpful in the past is to refer to Kubernetes--for example, the
>> > default strategy is exponential for pod restarts and we could draw
>> > inspiration from what they have set as a general purpose default config.
>> >
>> > Best,
>> > Mason
>> >
>> > On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <[email protected]> wrote:
>> >
>> > > Hi David and Mason,
>> > >
>> > > Thanks for your feedback!
>> > >
>> > > To David:
>> > >
>> > > > Given that the new default feels more complex than the current 
>> > > > behavior,
>> > > if we decide to do this I think it will be important to include the
>> > > rationale you've shared in the documentation.
>> > >
>> > > Sounds make sense to me, I will add the related doc if we
>> > > update the default strategy.
>> > >
>> > > To Mason:
>> > >
>> > > > I suppose we could do some benchmarking on what works well for the
>> > > resource providers that Flink relies on e.g. Kubernetes. Based on
>> > > conferences and blogs,
>> > > > it seems most people are relying on Kubernetes to deploy Flink and the
>> > > restart strategy has a large dependency on how well Kubernetes can scale 
>> > > to
>> > > requests to redeploy the job.
>> > >
>> > > Sorry, I didn't understand what type of benchmarking
>> > > we should do, could you elaborate on it? Thanks a lot.
>> > >
>> > > Best,
>> > > Rui
>> > >
>> > > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen <[email protected]> 
>> > > wrote:
>> > >
>> > >> Hi Rui,
>> > >>
>> > >> I suppose we could do some benchmarking on what works well for the
>> > >> resource providers that Flink relies on e.g. Kubernetes. Based on
>> > >> conferences and blogs, it seems most people are relying on Kubernetes to
>> > >> deploy Flink and the restart strategy has a large dependency on how well
>> > >> Kubernetes can scale to requests to redeploy the job.
>> > >>
>> > >> Best,
>> > >> Mason
>> > >>
>> > >> On Fri, Nov 17, 2023 at 10:07 AM David Anderson <[email protected]>
>> > >> wrote:
>> > >>
>> > >>> Rui,
>> > >>>
>> > >>> I don't have any direct experience with this topic, but given the
>> > >>> motivation you shared, the proposal makes sense to me. Given that the 
>> > >>> new
>> > >>> default feels more complex than the current behavior, if we decide to 
>> > >>> do
>> > >>> this I think it will be important to include the rationale you've 
>> > >>> shared in
>> > >>> the documentation.
>> > >>>
>> > >>> David
>> > >>>
>> > >>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <[email protected]> wrote:
>> > >>>
>> > >>>> Hi dear flink users and devs:
>> > >>>>
>> > >>>> FLIP-364[1] intends to make some improvements to restart-strategy
>> > >>>> and discuss updating some of the default values of exponential-delay,
>> > >>>> and whether exponential-delay can be used as the default
>> > >>>> restart-strategy.
>> > >>>> After discussing at dev mail list[2], we hope to collect more feedback
>> > >>>> from Flink users.
>> > >>>>
>> > >>>> # Why does the default restart-strategy need to be updated?
>> > >>>>
>> > >>>> If checkpointing is enabled, the default value is fixed-delay with
>> > >>>> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
>> > >>>> the job will restart infinitely with high frequency when a job
>> > >>>> continues to fail.
>> > >>>>
>> > >>>> When the Kafka cluster fails, a large number of flink jobs will be
>> > >>>> restarted frequently. After the kafka cluster is recovered, a large
>> > >>>> number of high-frequency restarts of flink jobs may cause the
>> > >>>> kafka cluster to avalanche again.
>> > >>>>
>> > >>>> Considering the exponential-delay as the default strategy with
>> > >>>> a couple of reasons:
>> > >>>>
>> > >>>> - The exponential-delay can reduce the restart frequency when
>> > >>>>   a job continues to fail.
>> > >>>> - It can restart a job quickly when a job fails occasionally.
>> > >>>> - The restart-strategy.exponential-delay.jitter-factor can avoid r
>> > >>>>   estarting multiple jobs at the same time. It’s useful to prevent
>> > >>>>   avalanches.
>> > >>>>
>> > >>>> # What are the current default values[4] of exponential-delay?
>> > >>>>
>> > >>>> restart-strategy.exponential-delay.initial-backoff : 1s
>> > >>>> restart-strategy.exponential-delay.backoff-multiplier : 2.0
>> > >>>> restart-strategy.exponential-delay.jitter-factor : 0.1
>> > >>>> restart-strategy.exponential-delay.max-backoff : 5 min
>> > >>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
>> > >>>>
>> > >>>> backoff-multiplier=2 means that the delay time of each restart
>> > >>>> will be doubled. The delay times are:
>> > >>>> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.
>> > >>>>
>> > >>>> The delay time is increased rapidly, it will affect the recover
>> > >>>> time for flink jobs.
>> > >>>>
>> > >>>> # Option improvements
>> > >>>>
>> > >>>> We think the backoff-multiplier between 1 and 2 is more sensible,
>> > >>>> such as:
>> > >>>>
>> > >>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2
>> > >>>> restart-strategy.exponential-delay.max-backoff : 1 min
>> > >>>>
>> > >>>> After updating, the delay times are:
>> > >>>>
>> > >>>> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
>> > >>>> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
>> > >>>> 22.186s, 26.623s, 31.948s, 38.337s, etc
>> > >>>>
>> > >>>> They achieve the following goals:
>> > >>>> - When restarts are infrequent in a short period of time, flink can
>> > >>>>   quickly restart the job. (For example: the retry delay time when
>> > >>>>   restarting 5 times is 2.073s)
>> > >>>> - When restarting frequently in a short period of time, flink can
>> > >>>>   slightly reduce the restart frequency to prevent avalanches.
>> > >>>>   (For example: the retry delay time when retrying 10 times is 5.1 s,
>> > >>>>   and the retry delay time when retrying 20 times is 38s, which is not
>> > >>>> very
>> > >>>> large.)
>> > >>>>
>> > >>>> As @Mingliang Liu <[email protected]>  mentioned at dev mail list: 
>> > >>>> the
>> > >>>> one-size-fits-all
>> > >>>> default values do not exist. So our goal is that the default values
>> > >>>> can be suitable for most jobs.
>> > >>>>
>> > >>>> Looking forward to your thoughts and feedback, thanks~
>> > >>>>
>> > >>>> [1] https://cwiki.apache.org/confluence/x/uJqzDw
>> > >>>> [2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
>> > >>>> [3]
>> > >>>>
>> > >>>> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
>> > >>>> [4]
>> > >>>>
>> > >>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy
>> > >>>>
>> > >>>> Best,
>> > >>>> Rui
>> > >>>>
>> > >>>

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

Reply via email to