Hi all, I would like to start a discussion on FLIP-364: Improve the restart-strategy[1]
As we know, the restart-strategy is critical for flink jobs, it mainly has two functions: 1. When an exception occurs in the flink job, quickly restart the job so that the job can return to the running state. 2. When a job cannot be recovered after frequent restarts within a certain period of time, Flink will not retry but will fail the job. The current restart-strategy support for function 2 has some issues: 1. The exponential-delay doesn't have the max attempts mechanism, it means that flink will restart indefinitely even if it fails frequently. 2. For multi-region streaming jobs and all batch jobs, the failure of each region will increase the total number of job failures by +1, even if these failures occur at the same time. If the number of failures increases too quickly, it will be difficult to set a reasonable number of retries. If the maximum number of failures is set too low, the job can easily reach the retry limit, causing the job to fail. If set too high, some jobs will never fail. In addition, when the above two problems are solved, we can also discuss whether exponential-delay can replace fixed-delay as the default restart-strategy. In theory, exponential-delay is smarter and friendlier than fixed-delay. I also thank Zhu Zhu for his suggestions on the option name in FLINK-32895[2] in advance. Looking forward to and welcome everyone's feedback and suggestions, thank you. [1] https://cwiki.apache.org/confluence/x/uJqzDw [2] https://issues.apache.org/jira/browse/FLINK-32895 Best, Rui
