Hi, all
We encounter some problem with FailureRateRestartStrategy, which confuse us and 
don't know how to solove it. Here's the situation:

Flink version: 1.10.1
Development env: on Yarn
FailureRateRestartStrategy: 
failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4

One of our hadoop machine got stuck without response, which our job's 
taskmanager running on. At this moment, the jobmanager receive a heartbeat 
timeout exception, but after throwing 4 times exception in a very short 
time(about 10ms each), it hit the FailureRateRestartStrategy and all job quit, 
we got the message of 'org.apache.flink.runtime.JobException: Recovery is 
suppressed by FailureRateRestartBackoffTimeStrategy'.
As I know from document, the behavior expected was jobmanager should try to 
restart the job which will bring up a new taskmanager on other machine, but it 
did not.
We also do some test, start a new job and just kill the taskamanger, but it can 
restart as expect.

So it confuse us most,  if anyone know what happen, that would be thanks.

JobManager log and TaskManager log append below

Attachment: JobManager.log
Description: JobManager.log

Attachment: TaskManager.log
Description: TaskManager.log

Reply via email to