Hi Zhu and Matthias: > 3. failure counting > Flink currently will try to recognize concurrent failures and group them > together, which can be seen in the web UI. So how about to align the > failure counting with the concurrent failures computing? This can make it > more consistent and easier for understanding. It will require changes to > the concurrent failures computing though, i.e. taking the backoff time > into consideration. So maybe we can open a seperate FLIP for this change.
I recently analyzed concurrentExceptions in detail, and after double-checking with Matthias who is the contributor of exception history. We found the concurrentExceptions doesn't work, it's always empty in production. I created FLINK-33565[1] to follow it. To Zhu: Discussed with Matthias, we prefer it as a separate JIRA, and FLIP-364 doesn't include it due to it's a separate bug. WDYT? Thanks Zhu mentioned the concurrentExceptions, and thanks Matthias help double check. [1] https://issues.apache.org/jira/browse/FLINK-33565 Best, Rui On Thu, Nov 16, 2023 at 11:39 AM Rui Fan <1996fan...@gmail.com> wrote: > Hi Zhu, Jing and Mingliang: > > Thanks for your feedback about consider exponential-delay > as the default restart-strategy, and updating the default > values of exponential-delay as well. I have started a > discussion on user, user-zh and dev mail list about it[1]. > > [1] https://lists.apache.org/thread/6glz0d57r8gtpzq4f71vf9066c5x6nyw > > Best, > Rui > > On Thu, Nov 16, 2023 at 6:35 AM Mingliang Liu <lium...@apache.org> wrote: > >> Thanks for sharing your data points. >> >> Among a few thousand jobs (from the smallest 1 task manager and the >> largest 300+ task managers), I presume most of them use the default. >> However, the default values we have been using were not broadly discussed >> but instead based on a priori knowledge as we manage many jobs for our >> (internal) customers. So I believe it's a good idea to engage with user ML >> for more feedback. Currently we rely on the two explicit config: >> >>> restart-strategy.exponential-delay.initial-backoff: 5 s >>> restart-strategy.exponential-delay.max-backoff: 2 min >> >> >> I think the default values in the FLIP look good to me overall, though I >> completely understand that the one-size-fits-all default values do not >> exist. Specifically, a multiplier value between 1 and 2 is more sensible to >> me than the existing value 2, if we enable exponential backoff as the >> default. The proposed value 1.2 is in this range. Jitter-factor being 0.1 >> and reset threshold being 1h are both the same as existing values. >> >> One question is the max attempts. Is that the max attempt after which the >> job will be deemed failed? I'm wondering if we just simplify the name from >> `max-attempts-before-reset-backoff` to `max-attempts` or just `attempts` >> (like the static strategy `restart-strategy.fixed-delay.attempts`). The >> wording `before-reset-backoff ` makes me think it's setting the backoff >> interval to its initial value after this max attempt, instead of failing >> the job. >> >> On Tue, Nov 14, 2023 at 8:16 PM Rui Fan <1996fan...@gmail.com> wrote: >> >>> Hi Mingliang: >>> >>> Thanks you for the feedback here! >>> >>> Glad to hear Netflix have made exponential-delay as the >>> default restart strategy. Our production(Shopee) also makes >>> exponential-delay as the default since May 2021, and the >>> current number of flink jobs far exceeds tens of thousands. >>> These jobs work well. >>> >>> Note: Our internal exponential-delay solves the problem >>> of a large number of tasks failing in a short period of time >>> causing restartAttempts to increase rapidly. >>> >>> Based on your production, do you have any suggestions >>> about default values of exponential-delay configuration? >>> >>> Zhu and Jing may also be interested in this question. >>> >>> Following are FLIP-364 proposed default values: >>> >>> restart-strategy.exponential-delay.max-attempts-before-reset-backoff : >>> Integer.MAX_VALUE >>> restart-strategy.exponential-delay.initial-backoff : 1s >>> restart-strategy.exponential-delay.backoff-multiplier : 1.2 >>> restart-strategy.exponential-delay.jitter-factor : 0.1 >>> restart-strategy.exponential-delay.max-backoff : 1 min >>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h >>> >>> Looking forward to your feedback! And I will start a discussion >>> on user mail list to collect more feedback. >>> >>> In addition, I understand that the community needs to consider >>> a lot of compatibility and risks when modifying the default value. >>> If this is very difficult to reach consensus on, I can remove >>> this item from FLIP. >>> >>> Best, >>> Rui >>> >>> On Wed, Nov 15, 2023 at 6:40 AM Mingliang Liu <lium...@apache.org> >>> wrote: >>> >>>> Thanks Rui for driving this. I just call out that making >>>> exponential-delay >>>> the default is a good change. At Netflix, we have enabled this as the >>>> default restart strategy 2 quarters ago and it has been working well. >>>> Keeping it restarting indefinitely by default makes sense to me. >>>> >>>> On Mon, Oct 16, 2023 at 10:11 PM Rui Fan <1996fan...@gmail.com> wrote: >>>> >>>> > Hi all, >>>> > >>>> > I would like to start a discussion on FLIP-364: Improve the >>>> > restart-strategy[1] >>>> > >>>> > As we know, the restart-strategy is critical for flink jobs, it mainly >>>> > has two functions: >>>> > 1. When an exception occurs in the flink job, quickly restart the job >>>> > so that the job can return to the running state. >>>> > 2. When a job cannot be recovered after frequent restarts within >>>> > a certain period of time, Flink will not retry but will fail the job. >>>> > >>>> > The current restart-strategy support for function 2 has some issues: >>>> > 1. The exponential-delay doesn't have the max attempts mechanism, >>>> > it means that flink will restart indefinitely even if it fails >>>> frequently. >>>> > 2. For multi-region streaming jobs and all batch jobs, the failure of >>>> > each region will increase the total number of job failures by +1, >>>> > even if these failures occur at the same time. If the number of >>>> > failures increases too quickly, it will be difficult to set a >>>> reasonable >>>> > number of retries. >>>> > If the maximum number of failures is set too low, the job can easily >>>> > reach the retry limit, causing the job to fail. If set too high, some >>>> jobs >>>> > will never fail. >>>> > >>>> > In addition, when the above two problems are solved, we can also >>>> > discuss whether exponential-delay can replace fixed-delay as the >>>> > default restart-strategy. In theory, exponential-delay is smarter and >>>> > friendlier than fixed-delay. >>>> > >>>> > I also thank Zhu Zhu for his suggestions on the option name in >>>> > FLINK-32895[2] in advance. >>>> > >>>> > Looking forward to and welcome everyone's feedback and suggestions, >>>> thank >>>> > you. >>>> > >>>> > [1] https://cwiki.apache.org/confluence/x/uJqzDw >>>> > [2] https://issues.apache.org/jira/browse/FLINK-32895 >>>> > >>>> > Best, >>>> > Rui >>>> > >>>> >>>