[DISCUSS] FLIP-364: Improve the restart-strategy

2023-10-16 Thread Rui Fan
Hi all, I would like to start a discussion on FLIP-364: Improve the restart-strategy[1] As we know, the restart-strategy is critical for flink jobs, it mainly has two functions: 1. When an exception occurs in the flink job, quickly restart the job so that the job can return to the running state.

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-10-19 Thread Konstantin Knauf
Hi Rui, Thank you for this proposal and working on this. I also agree that exponential back off makes sense as a new default in general. I think restarting indefinitely (no max attempts) makes sense by default, though, but of course allowing users to change is valuable. So, overall +1. Cheers,

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-10-19 Thread Maximilian Michels
Hey Rui, +1 for making exponential backoff the default. I agree with Konstantin that retrying forever is a good default for exponential backoff because oftentimes the issue will resolve eventually. The purpose of exponential backoff is precisely to continue to retry without causing too much load.

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-10-19 Thread Rui Fan
Hi Konstantin and Max, Thanks for your feedback! Sorry, I forgot to mention the default value of `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`. Retrying forever sounds good to me, I have added it to the FLIP: The default value of `restart-strategy.exponential-delay.max-

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-10 Thread Rui Fan
I'll start voting next Monday if there isn't any other comment. Best, Rui On Thu, Oct 19, 2023 at 6:59 PM Rui Fan <1996fan...@gmail.com> wrote: > Hi Konstantin and Max, > > Thanks for your feedback! > > Sorry, I forgot to mention the default value of > `restart-strategy.exponential-delay.max-att

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-13 Thread Zhu Zhu
Hi Rui, Thanks for creating this FLIP and sorry for jumping in so late into the discussion. The improvements of exponential-delay strategy and making it the default strategy looks good it me in general. I have some comments for it, as well as for the failure counting. 1. default values of exponen

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-13 Thread Jing Ge
Hi Rui, Thanks for the proposal! I agree with Zhu that any changes of the default behaviors will have impact on users' jobs in the production environment and it would be necessary to have users' attention to to avoid any surprises after upgrading Flink. @Zhu for 1, if we change the default values

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-14 Thread Rui Fan
Thanks a lot Zhu and Jing for the comments! Regarding concurrent failures mentioned by zhu, I am not familiar with it before and need some time to get familiar with it. So I will reply to them later. I will give Jing an answer first: > NIT: @Rui it would be great if you could point out the sourc

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-14 Thread Jing Ge
awesome! @Rui Thanks for your effort! Appreciate it! Best regards, Jing On Tue, Nov 14, 2023 at 1:32 PM Rui Fan <1996fan...@gmail.com> wrote: > Thanks a lot Zhu and Jing for the comments! > > Regarding concurrent failures mentioned by zhu, I am not familiar with it > before > and need some time

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-14 Thread Mingliang Liu
Thanks Rui for driving this. I just call out that making exponential-delay the default is a good change. At Netflix, we have enabled this as the default restart strategy 2 quarters ago and it has been working well. Keeping it restarting indefinitely by default makes sense to me. On Mon, Oct 16, 20

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-14 Thread Rui Fan
Hi Mingliang: Thanks you for the feedback here! Glad to hear Netflix have made exponential-delay as the default restart strategy. Our production(Shopee) also makes exponential-delay as the default since May 2021, and the current number of flink jobs far exceeds tens of thousands. These jobs work

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-15 Thread Mingliang Liu
Thanks for sharing your data points. Among a few thousand jobs (from the smallest 1 task manager and the largest 300+ task managers), I presume most of them use the default. However, the default values we have been using were not broadly discussed but instead based on a priori knowledge as we mana

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-15 Thread Rui Fan
Hi Zhu, Jing and Mingliang: Thanks for your feedback about consider exponential-delay as the default restart-strategy, and updating the default values of exponential-delay as well. I have started a discussion on user, user-zh and dev mail list about it[1]. [1] https://lists.apache.org/thread/6glz

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-15 Thread Rui Fan
Hi Zhu and Matthias: > 3. failure counting > Flink currently will try to recognize concurrent failures and group them > together, which can be seen in the web UI. So how about to align the > failure counting with the concurrent failures computing? This can make it > more consistent and easier for

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-16 Thread Rui Fan
Hi all, Zhu and I had an offline discussion today. We prefer this FLIP focuses on improving exponential-delay and uses exponential-delay as the default strategy. It means this FLIP doesn't include improvements related to fixed-delay and failover-delay, and the second part of FLIP(Improve restartAt

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-16 Thread Mingliang Liu
Thank you Rui. It makes sense to me now. On Thu, Nov 16, 2023 at 2:57 AM Rui Fan <1996fan...@gmail.com> wrote: > Hi all, > > Zhu and I had an offline discussion today. We prefer this FLIP > focuses on improving exponential-delay and uses exponential-delay > as the default strategy. It means this

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-29 Thread Rui Fan
Hi all, The user mail[1] has started for 13 days, and it collected one useful suggestion. > Given that the new default feels more complex than the current behavior, if we decide to do this I think it will be important to include the rationale you've shared in the documentation. I will add the re