Hi, Jiaqiao:

Since your job enables checkpoint, you can just try to remove the restart
strategy config. The default value will be fixed-delay with
Integer.MAX_VALUE restart attempts and '1 s' delay, as mentioned in [1]. In
this way when a failover occurs, your job will wait for 1 seconds before it
restarts. Since the value of max restart attempts is Integer.MAX_VALUE, the
job will not transition to FAILED unless a fatal error occurs.

Best,
Zhilong

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#restart-strategy

On Wed, Mar 2, 2022 at 1:55 PM 刘 家锹 <ljq1120799...@outlook.com> wrote:

> Hi, all
>
> I think we may find the reason, that's relate to the '
> *jobmanager.execution.failover-strategy*' configuration and the job
> region numbers. In our case, we set failover-strategy to 'region' and
> this job has 6 regions running on only one TaskManager. So when the
> container goes down, every regions need to be restart because they belong
> to this only one TaskManager.
> That's easy to tell that 4 retry times is not enough for 6 regions, so
> this job quit is reasonable.
> Also, why my testing job didn't quit, that's because this job is kind of
> different, it only has one region, so the behavior also expected.
>
> For us, we change failover-stratety to 'full', since most of our jobs has
> only one TaskManager and topology is simple. It will be helpful in most
> case. Further more, combine with region failover, that's kind of complex to
> configure a right parameter, we apply it to complex job only.
>
> If has any best practice about pipelined-region failover restart or
> document about region that would be helpfull.
>
> Again, thx for your time to reply, that help us a lot.
> ------------------------------
> *发件人:* 刘 家锹 <ljq1120799...@outlook.com>
> *发送时间:* 2022年3月1日 23:06
> *收件人:* Matthias Pohl <matth...@ververica.com>; user <user@flink.apache.org>;
> David Morávek <d...@apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> I realized I missed mentioning something above, the container exit code is
> 163, which is not the normal code, at least I can’t find any meaning from
> google. So, my test didn’t cover this situation, I don’t know whether it
> impacts the results.
>
> 获取 Outlook for iOS <https://aka.ms/o0ukef>
> ------------------------------
> *发件人:* 刘 家锹 <ljq1120799...@outlook.com>
> *发送时间:* Tuesday, March 1, 2022 10:23:50 PM
> *收件人:* Matthias Pohl <matth...@ververica.com>; user <user@flink.apache.org>;
> David Morávek <d...@apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> We didn't find any obvious configuration issues in our cluster. As far as
> I know, It works fine in most cases; I also simulate failover under current
> configuration, by starting a new job with only one TaskManager, then kill
> the TaskManager container, and this job recovery from failures
> successfully.
> As you said, yarn logs look it may have some problems, we try digging into
> it to see if we can find any hints.
>
> 获取 Outlook for iOS <https://aka.ms/o0ukef>
> ------------------------------
> *发件人:* Matthias Pohl <matth...@ververica.com>
> *发送时间:* Tuesday, March 1, 2022 9:50:36 PM
> *收件人:* 刘 家锹 <ljq1120799...@outlook.com>; user <user@flink.apache.org>;
> David Morávek <d...@apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> The YARN node manager logs support my observation: The container exits
> with a failure which, if I understand it correctly, should cause a
> container restart on the YARN side. In HA mode, Flink expects the
> underlying resource management to restart the Flink cluster in case of
> failure. This does not seem to happen in your case. Is there a
> configuration issue in your YARN cluster? Or does the container recovery
> usually work in failure cases for you? I'm not that experienced with YARN
> deployments. I'm adding David to this thread. He might have some additional
> insights.
>
> Matthias
>
> On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹 <ljq1120799...@outlook.com> wrote:
>
> Unfortunately we did't keep log properly , this happen too far away, yarn
> ResourceMnager log had clean,  and the broken machine had reinstall. We
> only found the yarn log of JobManager on Yarn NodeManager, it maybe
> useless. We will put the detail logs to this thread when it happen again,
> since it happen sometime, like between two weeks,  if one of our cluster
> machine go down.
> ------------------------------
> *发件人:* Matthias Pohl <matth...@ververica.com>
> *发送时间:* 2022年3月1日 17:57
> *收件人:* Alexander Preuß <alexanderpre...@ververica.com>
> *抄送:* 刘 家锹 <ljq1120799...@outlook.com>; user@flink.apache.org <
> user@flink.apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> Hi,
> I second Alex' observation - based on the logs it looks like the task
> restart functionality worked as expected: It tried to restart the tasks
> until it reached the limit of 4 attempts due to the missing TaskManager.
> The job-cluster shut down with an error code. At this point, YARN should
> pick it up and bring up a new JobManager based on the non-0 exit code of
> the Flink cluster. It would be interesting to see the YARN logs to figure
> out why the cluster failover didn't work.
>
> Best,
> Matthias
>
> On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <
> alexanderpre...@ververica.com> wrote:
>
> Hi,
> from a first glance it looks like the exception was thrown very rapidly so
> it exceeded the maxFailuresPerInterval and the FailureRestartStrategy
> decided not to restart. Why do you think this is different from the
> expected behavior?
>
> Best,
> Alex
>
> On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 <ljq1120799...@outlook.com> wrote:
>
> Hi, all
> We encounter some problem with FailureRateRestartStrategy, which confuse
> us and don't know how to solove it. Here's the situation:
>
> Flink version: 1.10.1
> Development env: on Yarn
>
> FailureRateRestartStrategy: 
> failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4
>
> One of our hadoop machine got stuck without response, which our job's
> taskmanager running on. At this moment, the jobmanager receive a heartbeat
> timeout exception, but after throwing 4 times exception in a very short
> time(about 10ms each), it hit the FailureRateRestartStrategy and all job
> quit, we got the message of 'org.apache.flink.runtime.JobException:
> Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
> As I know from document, the behavior expected was jobmanager should try
> to restart the job which will bring up a new taskmanager on other machine,
> but it did not.
> We also do some test, start a new job and just kill the taskamanger, but
> it can restart as expect.
>
> So it confuse us most,  if anyone know what happen, that would be thanks.
>
> JobManager log and TaskManager log append below
>
>
>
> --
>
> Alexander Preuß | Junior Engineer - Data Intensive Systems
>
> alexanderpre...@ververica.com
>
> <https://www.ververica.com/>
>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
>
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>
>

Reply via email to