I realized I missed mentioning something above, the container exit code is 163, 
which is not the normal code, at least I can’t find any meaning from google. 
So, my test didn’t cover this situation, I don’t know whether it impacts the 
results.

获取 Outlook for iOS<https://aka.ms/o0ukef>
________________________________
发件人: 刘 家锹 <ljq1120799...@outlook.com>
发送时间: Tuesday, March 1, 2022 10:23:50 PM
收件人: Matthias Pohl <matth...@ververica.com>; user <user@flink.apache.org>; 
David Morávek <d...@apache.org>
主题: Re: Flink failure rate restart not work as expect

We didn't find any obvious configuration issues in our cluster. As far as I 
know, It works fine in most cases; I also simulate failover under current 
configuration, by starting a new job with only one TaskManager, then kill the 
TaskManager container, and this job recovery from failures successfully.
As you said, yarn logs look it may have some problems, we try digging into it 
to see if we can find any hints.

获取 Outlook for iOS<https://aka.ms/o0ukef>
________________________________
发件人: Matthias Pohl <matth...@ververica.com>
发送时间: Tuesday, March 1, 2022 9:50:36 PM
收件人: 刘 家锹 <ljq1120799...@outlook.com>; user <user@flink.apache.org>; David 
Morávek <d...@apache.org>
主题: Re: Flink failure rate restart not work as expect

The YARN node manager logs support my observation: The container exits with a 
failure which, if I understand it correctly, should cause a container restart 
on the YARN side. In HA mode, Flink expects the underlying resource management 
to restart the Flink cluster in case of failure. This does not seem to happen 
in your case. Is there a configuration issue in your YARN cluster? Or does the 
container recovery usually work in failure cases for you? I'm not that 
experienced with YARN deployments. I'm adding David to this thread. He might 
have some additional insights.

Matthias

On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹 
<ljq1120799...@outlook.com<mailto:ljq1120799...@outlook.com>> wrote:
Unfortunately we did't keep log properly , this happen too far away, yarn 
ResourceMnager log had clean,  and the broken machine had reinstall. We only 
found the yarn log of JobManager on Yarn NodeManager, it maybe useless. We will 
put the detail logs to this thread when it happen again, since it happen 
sometime, like between two weeks,  if one of our cluster machine go down.
________________________________
发件人: Matthias Pohl <matth...@ververica.com<mailto:matth...@ververica.com>>
发送时间: 2022年3月1日 17:57
收件人: Alexander Preuß 
<alexanderpre...@ververica.com<mailto:alexanderpre...@ververica.com>>
抄送: 刘 家锹 <ljq1120799...@outlook.com<mailto:ljq1120799...@outlook.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org> 
<user@flink.apache.org<mailto:user@flink.apache.org>>
主题: Re: Flink failure rate restart not work as expect

Hi,
I second Alex' observation - based on the logs it looks like the task restart 
functionality worked as expected: It tried to restart the tasks until it 
reached the limit of 4 attempts due to the missing TaskManager. The job-cluster 
shut down with an error code. At this point, YARN should pick it up and bring 
up a new JobManager based on the non-0 exit code of the Flink cluster. It would 
be interesting to see the YARN logs to figure out why the cluster failover 
didn't work.

Best,
Matthias

On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß 
<alexanderpre...@ververica.com<mailto:alexanderpre...@ververica.com>> wrote:
Hi,
from a first glance it looks like the exception was thrown very rapidly so it 
exceeded the maxFailuresPerInterval and the FailureRestartStrategy decided not 
to restart. Why do you think this is different from the expected behavior?

Best,
Alex

On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 
<ljq1120799...@outlook.com<mailto:ljq1120799...@outlook.com>> wrote:
Hi, all
We encounter some problem with FailureRateRestartStrategy, which confuse us and 
don't know how to solove it. Here's the situation:

Flink version: 1.10.1
Development env: on Yarn
FailureRateRestartStrategy: 
failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4

One of our hadoop machine got stuck without response, which our job's 
taskmanager running on. At this moment, the jobmanager receive a heartbeat 
timeout exception, but after throwing 4 times exception in a very short 
time(about 10ms each), it hit the FailureRateRestartStrategy and all job quit, 
we got the message of 'org.apache.flink.runtime.JobException: Recovery is 
suppressed by FailureRateRestartBackoffTimeStrategy'.
As I know from document, the behavior expected was jobmanager should try to 
restart the job which will bring up a new taskmanager on other machine, but it 
did not.
We also do some test, start a new job and just kill the taskamanger, but it can 
restart as expect.

So it confuse us most,  if anyone know what happen, that would be thanks.

JobManager log and TaskManager log append below


--

Alexander Preuß | Junior Engineer - Data Intensive Systems

alexanderpre...@ververica.com<mailto:alexanderpre...@ververica.com>

[https://lh4.googleusercontent.com/NPTiLXYOUlWRdjeXe6hdOe_UvXESdi5aTB7HzziTY19ReOdVh04c4ED8DPqLmLHRlTiWHdtIjvMzFEUh0eoY7vOO_xTTAGmOxwlSQfwGN6tBbjSimj-eh5v094v1KHk5XOOoSBbU=s0]<https://www.ververica.com/>


Follow us @VervericaData

--

Join Flink Forward<https://flink-forward.org/> - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH

Registered at Amtsgericht Charlottenburg: HRB 158244 B

Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason, 
Jinwei (Kevin) Zhang

Reply via email to