Thanks everyone for the feedback! It doesn't have more feedback here, so I started the new vote[1] just now to update the default value of backoff-multiplier from 1.2 to 1.5.
[1] https://lists.apache.org/thread/0b1dcwb49owpm6v1j8rhrg9h0fvs5nkt Best, Rui On Tue, Dec 12, 2023 at 7:14 PM Maximilian Michels <m...@apache.org> wrote: > Thank you Rui! I think a 1.5 multiplier is a reasonable tradeoff > between restarting fast but not putting too much pressure on the > cluster due to restarts. > > -Max > > On Tue, Dec 12, 2023 at 8:19 AM Rui Fan <1996fan...@gmail.com> wrote: > > > > Hi Maximilian and Mason, > > > > Thanks a lot for your feedback! > > > > After an offline consultation with Max, I guess I understand your > > concern for now: when flink job restarts, it will make a bunch of > > calls to the Kubernetes API, e.g. read/write to config maps, create > > task managers. Currently, the default restart strategy is fixed-delay > > with 1s delay time, so flink will restart jobs with high frequency > > even if flink jobs cannot be started. It will cause the Kubernetes > > cluster became unstable. > > > > That's why I propose changing the default restart strategy to > > exponential-delay. It can achieve: restarts happen quickly > > enough unless there are consecutive failures. It is helpful for > > the stability of external components. > > > > After discussing with Max and Zhu Zhu at the PR comment[1], > > Max suggested using 1.5 as the default value of backoff-multiplier > > instead of 1.2. The 1.2 is a little small(delay time is too short). > > This picture[2] is the relationship between restart-attempts and > > retry-delay-time when backoff-multiplier is 1.2 and 1.5: > > > > - The delay-time will reach 1 min after 12 attempts when > backoff-multiplier is 1.5 > > - The delay-time will reach 1 min after 24 attempts when > backoff-multiplier is 1.2 > > > > Is there any other suggestion? Looking forward to more feedback, thanks~ > > > > BTW, as Zhu said in the comment[1], if we update the default value, > > a new vote is needed for this default value. So I will pause > > FLINK-33736[1] first, and the rest of the JIRAs of FLIP-364 will be > > continued. > > > > To Mason: > > > > If I understand your concerns correctly, I still don't know how > > to benchmark. The kubernetes cluster instability only happens > > when one cluster has a lot of jobs. In general, the test cannot > > reproduce the pressure. Could you elaborate on how to > > benchmark for this? > > > > After this FLIP, the default restart frequency will be reduced > > significantly. Especially when a job fails consecutively. > > Do you think the benchmark is necessary? > > > > Looking forward to your feedback, thanks~ > > > > [1] https://github.com/apache/flink/pull/23247#discussion_r1422626734 > > [2] > https://github.com/apache/flink/assets/38427477/642c57e0-b415-4326-af05-8b506c5fbb3a > > [3] https://issues.apache.org/jira/browse/FLINK-33736 > > > > Best, > > Rui > > > > On Thu, Dec 7, 2023 at 10:57 PM Maximilian Michels <m...@apache.org> > wrote: > >> > >> Hey Rui, > >> > >> +1 for changing the default restart strategy to exponential-delay. > >> This is something all users eventually run into. They end up changing > >> the restart strategy to exponential-delay. I think the current > >> defaults are quite balanced. Restarts happen quickly enough unless > >> there are consecutive failures where I think it makes sense to double > >> the waiting time up till the max. > >> > >> -Max > >> > >> > >> On Wed, Dec 6, 2023 at 12:51 AM Mason Chen <mas.chen6...@gmail.com> > wrote: > >> > > >> > Hi Rui, > >> > > >> > Sorry for the late reply. I was suggesting that perhaps we could do > some > >> > testing with Kubernetes wrt configuring values for the exponential > restart > >> > strategy. We've noticed that the default strategy in 1.17 caused a > lot of > >> > requests to the K8s API server for unstable deployments. > >> > > >> > However, people in different Kubernetes setups will have different > limits > >> > so it would be challenging to provide a general benchmark. Another > thing I > >> > found helpful in the past is to refer to Kubernetes--for example, the > >> > default strategy is exponential for pod restarts and we could draw > >> > inspiration from what they have set as a general purpose default > config. > >> > > >> > Best, > >> > Mason > >> > > >> > On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote: > >> > > >> > > Hi David and Mason, > >> > > > >> > > Thanks for your feedback! > >> > > > >> > > To David: > >> > > > >> > > > Given that the new default feels more complex than the current > behavior, > >> > > if we decide to do this I think it will be important to include the > >> > > rationale you've shared in the documentation. > >> > > > >> > > Sounds make sense to me, I will add the related doc if we > >> > > update the default strategy. > >> > > > >> > > To Mason: > >> > > > >> > > > I suppose we could do some benchmarking on what works well for the > >> > > resource providers that Flink relies on e.g. Kubernetes. Based on > >> > > conferences and blogs, > >> > > > it seems most people are relying on Kubernetes to deploy Flink > and the > >> > > restart strategy has a large dependency on how well Kubernetes can > scale to > >> > > requests to redeploy the job. > >> > > > >> > > Sorry, I didn't understand what type of benchmarking > >> > > we should do, could you elaborate on it? Thanks a lot. > >> > > > >> > > Best, > >> > > Rui > >> > > > >> > > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen <mas.chen6...@gmail.com> > wrote: > >> > > > >> > >> Hi Rui, > >> > >> > >> > >> I suppose we could do some benchmarking on what works well for the > >> > >> resource providers that Flink relies on e.g. Kubernetes. Based on > >> > >> conferences and blogs, it seems most people are relying on > Kubernetes to > >> > >> deploy Flink and the restart strategy has a large dependency on > how well > >> > >> Kubernetes can scale to requests to redeploy the job. > >> > >> > >> > >> Best, > >> > >> Mason > >> > >> > >> > >> On Fri, Nov 17, 2023 at 10:07 AM David Anderson < > dander...@apache.org> > >> > >> wrote: > >> > >> > >> > >>> Rui, > >> > >>> > >> > >>> I don't have any direct experience with this topic, but given the > >> > >>> motivation you shared, the proposal makes sense to me. Given that > the new > >> > >>> default feels more complex than the current behavior, if we > decide to do > >> > >>> this I think it will be important to include the rationale you've > shared in > >> > >>> the documentation. > >> > >>> > >> > >>> David > >> > >>> > >> > >>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <1996fan...@gmail.com> > wrote: > >> > >>> > >> > >>>> Hi dear flink users and devs: > >> > >>>> > >> > >>>> FLIP-364[1] intends to make some improvements to restart-strategy > >> > >>>> and discuss updating some of the default values of > exponential-delay, > >> > >>>> and whether exponential-delay can be used as the default > >> > >>>> restart-strategy. > >> > >>>> After discussing at dev mail list[2], we hope to collect more > feedback > >> > >>>> from Flink users. > >> > >>>> > >> > >>>> # Why does the default restart-strategy need to be updated? > >> > >>>> > >> > >>>> If checkpointing is enabled, the default value is fixed-delay > with > >> > >>>> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means > >> > >>>> the job will restart infinitely with high frequency when a job > >> > >>>> continues to fail. > >> > >>>> > >> > >>>> When the Kafka cluster fails, a large number of flink jobs will > be > >> > >>>> restarted frequently. After the kafka cluster is recovered, a > large > >> > >>>> number of high-frequency restarts of flink jobs may cause the > >> > >>>> kafka cluster to avalanche again. > >> > >>>> > >> > >>>> Considering the exponential-delay as the default strategy with > >> > >>>> a couple of reasons: > >> > >>>> > >> > >>>> - The exponential-delay can reduce the restart frequency when > >> > >>>> a job continues to fail. > >> > >>>> - It can restart a job quickly when a job fails occasionally. > >> > >>>> - The restart-strategy.exponential-delay.jitter-factor can avoid > r > >> > >>>> estarting multiple jobs at the same time. It’s useful to > prevent > >> > >>>> avalanches. > >> > >>>> > >> > >>>> # What are the current default values[4] of exponential-delay? > >> > >>>> > >> > >>>> restart-strategy.exponential-delay.initial-backoff : 1s > >> > >>>> restart-strategy.exponential-delay.backoff-multiplier : 2.0 > >> > >>>> restart-strategy.exponential-delay.jitter-factor : 0.1 > >> > >>>> restart-strategy.exponential-delay.max-backoff : 5 min > >> > >>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h > >> > >>>> > >> > >>>> backoff-multiplier=2 means that the delay time of each restart > >> > >>>> will be doubled. The delay times are: > >> > >>>> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc. > >> > >>>> > >> > >>>> The delay time is increased rapidly, it will affect the recover > >> > >>>> time for flink jobs. > >> > >>>> > >> > >>>> # Option improvements > >> > >>>> > >> > >>>> We think the backoff-multiplier between 1 and 2 is more sensible, > >> > >>>> such as: > >> > >>>> > >> > >>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2 > >> > >>>> restart-strategy.exponential-delay.max-backoff : 1 min > >> > >>>> > >> > >>>> After updating, the delay times are: > >> > >>>> > >> > >>>> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s, > >> > >>>> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, > 18.488s, > >> > >>>> 22.186s, 26.623s, 31.948s, 38.337s, etc > >> > >>>> > >> > >>>> They achieve the following goals: > >> > >>>> - When restarts are infrequent in a short period of time, flink > can > >> > >>>> quickly restart the job. (For example: the retry delay time > when > >> > >>>> restarting 5 times is 2.073s) > >> > >>>> - When restarting frequently in a short period of time, flink can > >> > >>>> slightly reduce the restart frequency to prevent avalanches. > >> > >>>> (For example: the retry delay time when retrying 10 times is > 5.1 s, > >> > >>>> and the retry delay time when retrying 20 times is 38s, which > is not > >> > >>>> very > >> > >>>> large.) > >> > >>>> > >> > >>>> As @Mingliang Liu <lium...@apache.org> mentioned at dev mail > list: the > >> > >>>> one-size-fits-all > >> > >>>> default values do not exist. So our goal is that the default > values > >> > >>>> can be suitable for most jobs. > >> > >>>> > >> > >>>> Looking forward to your thoughts and feedback, thanks~ > >> > >>>> > >> > >>>> [1] https://cwiki.apache.org/confluence/x/uJqzDw > >> > >>>> [2] > https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym > >> > >>>> [3] > >> > >>>> > >> > >>>> > https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type > >> > >>>> [4] > >> > >>>> > >> > >>>> > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy > >> > >>>> > >> > >>>> Best, > >> > >>>> Rui > >> > >>>> > >> > >>> >