I realized I missed mentioning something above, the container exit code is 163, which is not the normal code, at least I can’t find any meaning from google. So, my test didn’t cover this situation, I don’t know whether it impacts the results.
获取 Outlook for iOS<https://aka.ms/o0ukef> ________________________________ 发件人: 刘 家锹 <ljq1120799...@outlook.com> 发送时间: Tuesday, March 1, 2022 10:23:50 PM 收件人: Matthias Pohl <matth...@ververica.com>; user <user@flink.apache.org>; David Morávek <d...@apache.org> 主题: Re: Flink failure rate restart not work as expect We didn't find any obvious configuration issues in our cluster. As far as I know, It works fine in most cases; I also simulate failover under current configuration, by starting a new job with only one TaskManager, then kill the TaskManager container, and this job recovery from failures successfully. As you said, yarn logs look it may have some problems, we try digging into it to see if we can find any hints. 获取 Outlook for iOS<https://aka.ms/o0ukef> ________________________________ 发件人: Matthias Pohl <matth...@ververica.com> 发送时间: Tuesday, March 1, 2022 9:50:36 PM 收件人: 刘 家锹 <ljq1120799...@outlook.com>; user <user@flink.apache.org>; David Morávek <d...@apache.org> 主题: Re: Flink failure rate restart not work as expect The YARN node manager logs support my observation: The container exits with a failure which, if I understand it correctly, should cause a container restart on the YARN side. In HA mode, Flink expects the underlying resource management to restart the Flink cluster in case of failure. This does not seem to happen in your case. Is there a configuration issue in your YARN cluster? Or does the container recovery usually work in failure cases for you? I'm not that experienced with YARN deployments. I'm adding David to this thread. He might have some additional insights. Matthias On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹 <ljq1120799...@outlook.com<mailto:ljq1120799...@outlook.com>> wrote: Unfortunately we did't keep log properly , this happen too far away, yarn ResourceMnager log had clean, and the broken machine had reinstall. We only found the yarn log of JobManager on Yarn NodeManager, it maybe useless. We will put the detail logs to this thread when it happen again, since it happen sometime, like between two weeks, if one of our cluster machine go down. ________________________________ 发件人: Matthias Pohl <matth...@ververica.com<mailto:matth...@ververica.com>> 发送时间: 2022年3月1日 17:57 收件人: Alexander Preuß <alexanderpre...@ververica.com<mailto:alexanderpre...@ververica.com>> 抄送: 刘 家锹 <ljq1120799...@outlook.com<mailto:ljq1120799...@outlook.com>>; user@flink.apache.org<mailto:user@flink.apache.org> <user@flink.apache.org<mailto:user@flink.apache.org>> 主题: Re: Flink failure rate restart not work as expect Hi, I second Alex' observation - based on the logs it looks like the task restart functionality worked as expected: It tried to restart the tasks until it reached the limit of 4 attempts due to the missing TaskManager. The job-cluster shut down with an error code. At this point, YARN should pick it up and bring up a new JobManager based on the non-0 exit code of the Flink cluster. It would be interesting to see the YARN logs to figure out why the cluster failover didn't work. Best, Matthias On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <alexanderpre...@ververica.com<mailto:alexanderpre...@ververica.com>> wrote: Hi, from a first glance it looks like the exception was thrown very rapidly so it exceeded the maxFailuresPerInterval and the FailureRestartStrategy decided not to restart. Why do you think this is different from the expected behavior? Best, Alex On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 <ljq1120799...@outlook.com<mailto:ljq1120799...@outlook.com>> wrote: Hi, all We encounter some problem with FailureRateRestartStrategy, which confuse us and don't know how to solove it. Here's the situation: Flink version: 1.10.1 Development env: on Yarn FailureRateRestartStrategy: failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4 One of our hadoop machine got stuck without response, which our job's taskmanager running on. At this moment, the jobmanager receive a heartbeat timeout exception, but after throwing 4 times exception in a very short time(about 10ms each), it hit the FailureRateRestartStrategy and all job quit, we got the message of 'org.apache.flink.runtime.JobException: Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'. As I know from document, the behavior expected was jobmanager should try to restart the job which will bring up a new taskmanager on other machine, but it did not. We also do some test, start a new job and just kill the taskamanger, but it can restart as expect. So it confuse us most, if anyone know what happen, that would be thanks. JobManager log and TaskManager log append below -- Alexander Preuß | Junior Engineer - Data Intensive Systems alexanderpre...@ververica.com<mailto:alexanderpre...@ververica.com> [https://lh4.googleusercontent.com/NPTiLXYOUlWRdjeXe6hdOe_UvXESdi5aTB7HzziTY19ReOdVh04c4ED8DPqLmLHRlTiWHdtIjvMzFEUh0eoY7vOO_xTTAGmOxwlSQfwGN6tBbjSimj-eh5v094v1KHk5XOOoSBbU=s0]<https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward<https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason, Jinwei (Kevin) Zhang