Hi, all We encounter some problem with FailureRateRestartStrategy, which confuse us and don't know how to solove it. Here's the situation:
Flink version: 1.10.1 Development env: on Yarn FailureRateRestartStrategy: failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4 One of our hadoop machine got stuck without response, which our job's taskmanager running on. At this moment, the jobmanager receive a heartbeat timeout exception, but after throwing 4 times exception in a very short time(about 10ms each), it hit the FailureRateRestartStrategy and all job quit, we got the message of 'org.apache.flink.runtime.JobException: Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'. As I know from document, the behavior expected was jobmanager should try to restart the job which will bring up a new taskmanager on other machine, but it did not. We also do some test, start a new job and just kill the taskamanger, but it can restart as expect. So it confuse us most, if anyone know what happen, that would be thanks. JobManager log and TaskManager log append below
JobManager.log
Description: JobManager.log
TaskManager.log
Description: TaskManager.log