Re: [DISCUSS] Change the default restart-strategy to exponential-delay

Mason Chen Tue, 05 Dec 2023 15:51:34 -0800

Hi Rui,

Sorry for the late reply. I was suggesting that perhaps we could do some
testing with Kubernetes wrt configuring values for the exponential restart
strategy. We've noticed that the default strategy in 1.17 caused a lot of
requests to the K8s API server for unstable deployments.


However, people in different Kubernetes setups will have different limits
so it would be challenging to provide a general benchmark. Another thing I
found helpful in the past is to refer to Kubernetes--for example, the
default strategy is exponential for pod restarts and we could draw
inspiration from what they have set as a general purpose default config.

Best,
Mason

On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <[email protected]> wrote:

> Hi David and Mason,
>
> Thanks for your feedback!
>
> To David:
>
> > Given that the new default feels more complex than the current behavior,
> if we decide to do this I think it will be important to include the
> rationale you've shared in the documentation.
>
> Sounds make sense to me, I will add the related doc if we
> update the default strategy.
>
> To Mason:
>
> > I suppose we could do some benchmarking on what works well for the
> resource providers that Flink relies on e.g. Kubernetes. Based on
> conferences and blogs,
> > it seems most people are relying on Kubernetes to deploy Flink and the
> restart strategy has a large dependency on how well Kubernetes can scale to
> requests to redeploy the job.
>
> Sorry, I didn't understand what type of benchmarking
> we should do, could you elaborate on it? Thanks a lot.
>
> Best,
> Rui
>
> On Sat, Nov 18, 2023 at 3:32 AM Mason Chen <[email protected]> wrote:
>
>> Hi Rui,
>>
>> I suppose we could do some benchmarking on what works well for the
>> resource providers that Flink relies on e.g. Kubernetes. Based on
>> conferences and blogs, it seems most people are relying on Kubernetes to
>> deploy Flink and the restart strategy has a large dependency on how well
>> Kubernetes can scale to requests to redeploy the job.
>>
>> Best,
>> Mason
>>
>> On Fri, Nov 17, 2023 at 10:07 AM David Anderson <[email protected]>
>> wrote:
>>
>>> Rui,
>>>
>>> I don't have any direct experience with this topic, but given the
>>> motivation you shared, the proposal makes sense to me. Given that the new
>>> default feels more complex than the current behavior, if we decide to do
>>> this I think it will be important to include the rationale you've shared in
>>> the documentation.
>>>
>>> David
>>>
>>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <[email protected]> wrote:
>>>
>>>> Hi dear flink users and devs:
>>>>
>>>> FLIP-364[1] intends to make some improvements to restart-strategy
>>>> and discuss updating some of the default values of exponential-delay,
>>>> and whether exponential-delay can be used as the default
>>>> restart-strategy.
>>>> After discussing at dev mail list[2], we hope to collect more feedback
>>>> from Flink users.
>>>>
>>>> # Why does the default restart-strategy need to be updated?
>>>>
>>>> If checkpointing is enabled, the default value is fixed-delay with
>>>> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
>>>> the job will restart infinitely with high frequency when a job
>>>> continues to fail.
>>>>
>>>> When the Kafka cluster fails, a large number of flink jobs will be
>>>> restarted frequently. After the kafka cluster is recovered, a large
>>>> number of high-frequency restarts of flink jobs may cause the
>>>> kafka cluster to avalanche again.
>>>>
>>>> Considering the exponential-delay as the default strategy with
>>>> a couple of reasons:
>>>>
>>>> - The exponential-delay can reduce the restart frequency when
>>>>   a job continues to fail.
>>>> - It can restart a job quickly when a job fails occasionally.
>>>> - The restart-strategy.exponential-delay.jitter-factor can avoid r
>>>>   estarting multiple jobs at the same time. It’s useful to prevent
>>>>   avalanches.
>>>>
>>>> # What are the current default values[4] of exponential-delay?
>>>>
>>>> restart-strategy.exponential-delay.initial-backoff : 1s
>>>> restart-strategy.exponential-delay.backoff-multiplier : 2.0
>>>> restart-strategy.exponential-delay.jitter-factor : 0.1
>>>> restart-strategy.exponential-delay.max-backoff : 5 min
>>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
>>>>
>>>> backoff-multiplier=2 means that the delay time of each restart
>>>> will be doubled. The delay times are:
>>>> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.
>>>>
>>>> The delay time is increased rapidly, it will affect the recover
>>>> time for flink jobs.
>>>>
>>>> # Option improvements
>>>>
>>>> We think the backoff-multiplier between 1 and 2 is more sensible,
>>>> such as:
>>>>
>>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2
>>>> restart-strategy.exponential-delay.max-backoff : 1 min
>>>>
>>>> After updating, the delay times are:
>>>>
>>>> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
>>>> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
>>>> 22.186s, 26.623s, 31.948s, 38.337s, etc
>>>>
>>>> They achieve the following goals:
>>>> - When restarts are infrequent in a short period of time, flink can
>>>>   quickly restart the job. (For example: the retry delay time when
>>>>   restarting 5 times is 2.073s)
>>>> - When restarting frequently in a short period of time, flink can
>>>>   slightly reduce the restart frequency to prevent avalanches.
>>>>   (For example: the retry delay time when retrying 10 times is 5.1 s,
>>>>   and the retry delay time when retrying 20 times is 38s, which is not
>>>> very
>>>> large.)
>>>>
>>>> As @Mingliang Liu <[email protected]>  mentioned at dev mail list: the
>>>> one-size-fits-all
>>>> default values do not exist. So our goal is that the default values
>>>> can be suitable for most jobs.
>>>>
>>>> Looking forward to your thoughts and feedback, thanks~
>>>>
>>>> [1] https://cwiki.apache.org/confluence/x/uJqzDw
>>>> [2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
>>>> [3]
>>>>
>>>> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
>>>> [4]
>>>>
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy
>>>>
>>>> Best,
>>>> Rui
>>>>
>>>

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

Reply via email to