The YARN node manager logs support my observation: The container exits with
a failure which, if I understand it correctly, should cause a container
restart on the YARN side. In HA mode, Flink expects the underlying resource
management to restart the Flink cluster in case of failure. This does not
seem to happen in your case. Is there a configuration issue in your YARN
cluster? Or does the container recovery usually work in failure cases for
you? I'm not that experienced with YARN deployments. I'm adding David to
this thread. He might have some additional insights.

Matthias

On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹 <ljq1120799...@outlook.com> wrote:

> Unfortunately we did't keep log properly , this happen too far away, yarn
> ResourceMnager log had clean,  and the broken machine had reinstall. We
> only found the yarn log of JobManager on Yarn NodeManager, it maybe
> useless. We will put the detail logs to this thread when it happen again,
> since it happen sometime, like between two weeks,  if one of our cluster
> machine go down.
> ------------------------------
> *发件人:* Matthias Pohl <matth...@ververica.com>
> *发送时间:* 2022年3月1日 17:57
> *收件人:* Alexander Preuß <alexanderpre...@ververica.com>
> *抄送:* 刘 家锹 <ljq1120799...@outlook.com>; user@flink.apache.org <
> user@flink.apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> Hi,
> I second Alex' observation - based on the logs it looks like the task
> restart functionality worked as expected: It tried to restart the tasks
> until it reached the limit of 4 attempts due to the missing TaskManager.
> The job-cluster shut down with an error code. At this point, YARN should
> pick it up and bring up a new JobManager based on the non-0 exit code of
> the Flink cluster. It would be interesting to see the YARN logs to figure
> out why the cluster failover didn't work.
>
> Best,
> Matthias
>
> On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <
> alexanderpre...@ververica.com> wrote:
>
> Hi,
> from a first glance it looks like the exception was thrown very rapidly so
> it exceeded the maxFailuresPerInterval and the FailureRestartStrategy
> decided not to restart. Why do you think this is different from the
> expected behavior?
>
> Best,
> Alex
>
> On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 <ljq1120799...@outlook.com> wrote:
>
> Hi, all
> We encounter some problem with FailureRateRestartStrategy, which confuse
> us and don't know how to solove it. Here's the situation:
>
> Flink version: 1.10.1
> Development env: on Yarn
>
> FailureRateRestartStrategy: 
> failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4
>
> One of our hadoop machine got stuck without response, which our job's
> taskmanager running on. At this moment, the jobmanager receive a heartbeat
> timeout exception, but after throwing 4 times exception in a very short
> time(about 10ms each), it hit the FailureRateRestartStrategy and all job
> quit, we got the message of 'org.apache.flink.runtime.JobException:
> Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
> As I know from document, the behavior expected was jobmanager should try
> to restart the job which will bring up a new taskmanager on other machine,
> but it did not.
> We also do some test, start a new job and just kill the taskamanger, but
> it can restart as expect.
>
> So it confuse us most,  if anyone know what happen, that would be thanks.
>
> JobManager log and TaskManager log append below
>
>
>
> --
>
> Alexander Preuß | Junior Engineer - Data Intensive Systems
>
> alexanderpre...@ververica.com
>
> <https://www.ververica.com/>
>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
>
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>
>

Reply via email to