[DISCUSS] FLIP-364: Improve the restart-strategy

Rui Fan Mon, 16 Oct 2023 22:11:59 -0700

Hi all,

I would like to start a discussion on FLIP-364: Improve the
restart-strategy[1]


As we know, the restart-strategy is critical for flink jobs, it mainly
has two functions:
1. When an exception occurs in the flink job, quickly restart the job
so that the job can return to the running state.
2. When a job cannot be recovered after frequent restarts within
a certain period of time, Flink will not retry but will fail the job.

The current restart-strategy support for function 2 has some issues:
1. The exponential-delay doesn't have the max attempts mechanism,
it means that flink will restart indefinitely even if it fails frequently.
2. For multi-region streaming jobs and all batch jobs, the failure of
each region will increase the total number of job failures by +1,
even if these failures occur at the same time. If the number of
failures increases too quickly, it will be difficult to set a reasonable
number of retries.
If the maximum number of failures is set too low, the job can easily
reach the retry limit, causing the job to fail. If set too high, some jobs
will never fail.

In addition, when the above two problems are solved, we can also
discuss whether exponential-delay can replace fixed-delay as the
default restart-strategy. In theory, exponential-delay is smarter and
friendlier than fixed-delay.

I also thank Zhu Zhu for his suggestions on the option name in
FLINK-32895[2] in advance.

Looking forward to and welcome everyone's feedback and suggestions, thank
you.

[1] https://cwiki.apache.org/confluence/x/uJqzDw
[2] https://issues.apache.org/jira/browse/FLINK-32895

Best,
Rui

[DISCUSS] FLIP-364: Improve the restart-strategy

Reply via email to