Flink failure rate restart not work as expect

刘家锹 Mon, 28 Feb 2022 18:23:20 -0800

Hi, all
We encounter some problem with FailureRateRestartStrategy, which confuse us and 
don't know how to solove it. Here's the situation:


Flink version: 1.10.1
Development env: on Yarn
FailureRateRestartStrategy: 
failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4

One of our hadoop machine got stuck without response, which our job's 
taskmanager running on. At this moment, the jobmanager receive a heartbeat 
timeout exception, but after throwing 4 times exception in a very short 
time(about 10ms each), it hit the FailureRateRestartStrategy and all job quit, 
we got the message of 'org.apache.flink.runtime.JobException: Recovery is 
suppressed by FailureRateRestartBackoffTimeStrategy'.
As I know from document, the behavior expected was jobmanager should try to 
restart the job which will bring up a new taskmanager on other machine, but it 
did not.
We also do some test, start a new job and just kill the taskamanger, but it can 
restart as expect.

So it confuse us most,  if anyone know what happen, that would be thanks.

JobManager log and TaskManager log append below

JobManager.log
Description: JobManager.log

TaskManager.log
Description: TaskManager.log

Flink failure rate restart not work as expect

Reply via email to