Hi, Jiaqiao: Since your job enables checkpoint, you can just try to remove the restart strategy config. The default value will be fixed-delay with Integer.MAX_VALUE restart attempts and '1 s' delay, as mentioned in [1]. In this way when a failover occurs, your job will wait for 1 seconds before it restarts. Since the value of max restart attempts is Integer.MAX_VALUE, the job will not transition to FAILED unless a fatal error occurs.
Best, Zhilong [1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#restart-strategy On Wed, Mar 2, 2022 at 1:55 PM 刘 家锹 <ljq1120799...@outlook.com> wrote: > Hi, all > > I think we may find the reason, that's relate to the ' > *jobmanager.execution.failover-strategy*' configuration and the job > region numbers. In our case, we set failover-strategy to 'region' and > this job has 6 regions running on only one TaskManager. So when the > container goes down, every regions need to be restart because they belong > to this only one TaskManager. > That's easy to tell that 4 retry times is not enough for 6 regions, so > this job quit is reasonable. > Also, why my testing job didn't quit, that's because this job is kind of > different, it only has one region, so the behavior also expected. > > For us, we change failover-stratety to 'full', since most of our jobs has > only one TaskManager and topology is simple. It will be helpful in most > case. Further more, combine with region failover, that's kind of complex to > configure a right parameter, we apply it to complex job only. > > If has any best practice about pipelined-region failover restart or > document about region that would be helpfull. > > Again, thx for your time to reply, that help us a lot. > ------------------------------ > *发件人:* 刘 家锹 <ljq1120799...@outlook.com> > *发送时间:* 2022年3月1日 23:06 > *收件人:* Matthias Pohl <matth...@ververica.com>; user <user@flink.apache.org>; > David Morávek <d...@apache.org> > *主题:* Re: Flink failure rate restart not work as expect > > I realized I missed mentioning something above, the container exit code is > 163, which is not the normal code, at least I can’t find any meaning from > google. So, my test didn’t cover this situation, I don’t know whether it > impacts the results. > > 获取 Outlook for iOS <https://aka.ms/o0ukef> > ------------------------------ > *发件人:* 刘 家锹 <ljq1120799...@outlook.com> > *发送时间:* Tuesday, March 1, 2022 10:23:50 PM > *收件人:* Matthias Pohl <matth...@ververica.com>; user <user@flink.apache.org>; > David Morávek <d...@apache.org> > *主题:* Re: Flink failure rate restart not work as expect > > We didn't find any obvious configuration issues in our cluster. As far as > I know, It works fine in most cases; I also simulate failover under current > configuration, by starting a new job with only one TaskManager, then kill > the TaskManager container, and this job recovery from failures > successfully. > As you said, yarn logs look it may have some problems, we try digging into > it to see if we can find any hints. > > 获取 Outlook for iOS <https://aka.ms/o0ukef> > ------------------------------ > *发件人:* Matthias Pohl <matth...@ververica.com> > *发送时间:* Tuesday, March 1, 2022 9:50:36 PM > *收件人:* 刘 家锹 <ljq1120799...@outlook.com>; user <user@flink.apache.org>; > David Morávek <d...@apache.org> > *主题:* Re: Flink failure rate restart not work as expect > > The YARN node manager logs support my observation: The container exits > with a failure which, if I understand it correctly, should cause a > container restart on the YARN side. In HA mode, Flink expects the > underlying resource management to restart the Flink cluster in case of > failure. This does not seem to happen in your case. Is there a > configuration issue in your YARN cluster? Or does the container recovery > usually work in failure cases for you? I'm not that experienced with YARN > deployments. I'm adding David to this thread. He might have some additional > insights. > > Matthias > > On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹 <ljq1120799...@outlook.com> wrote: > > Unfortunately we did't keep log properly , this happen too far away, yarn > ResourceMnager log had clean, and the broken machine had reinstall. We > only found the yarn log of JobManager on Yarn NodeManager, it maybe > useless. We will put the detail logs to this thread when it happen again, > since it happen sometime, like between two weeks, if one of our cluster > machine go down. > ------------------------------ > *发件人:* Matthias Pohl <matth...@ververica.com> > *发送时间:* 2022年3月1日 17:57 > *收件人:* Alexander Preuß <alexanderpre...@ververica.com> > *抄送:* 刘 家锹 <ljq1120799...@outlook.com>; user@flink.apache.org < > user@flink.apache.org> > *主题:* Re: Flink failure rate restart not work as expect > > Hi, > I second Alex' observation - based on the logs it looks like the task > restart functionality worked as expected: It tried to restart the tasks > until it reached the limit of 4 attempts due to the missing TaskManager. > The job-cluster shut down with an error code. At this point, YARN should > pick it up and bring up a new JobManager based on the non-0 exit code of > the Flink cluster. It would be interesting to see the YARN logs to figure > out why the cluster failover didn't work. > > Best, > Matthias > > On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß < > alexanderpre...@ververica.com> wrote: > > Hi, > from a first glance it looks like the exception was thrown very rapidly so > it exceeded the maxFailuresPerInterval and the FailureRestartStrategy > decided not to restart. Why do you think this is different from the > expected behavior? > > Best, > Alex > > On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 <ljq1120799...@outlook.com> wrote: > > Hi, all > We encounter some problem with FailureRateRestartStrategy, which confuse > us and don't know how to solove it. Here's the situation: > > Flink version: 1.10.1 > Development env: on Yarn > > FailureRateRestartStrategy: > failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4 > > One of our hadoop machine got stuck without response, which our job's > taskmanager running on. At this moment, the jobmanager receive a heartbeat > timeout exception, but after throwing 4 times exception in a very short > time(about 10ms each), it hit the FailureRateRestartStrategy and all job > quit, we got the message of 'org.apache.flink.runtime.JobException: > Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'. > As I know from document, the behavior expected was jobmanager should try > to restart the job which will bring up a new taskmanager on other machine, > but it did not. > We also do some test, start a new job and just kill the taskamanger, but > it can restart as expect. > > So it confuse us most, if anyone know what happen, that would be thanks. > > JobManager log and TaskManager log append below > > > > -- > > Alexander Preuß | Junior Engineer - Data Intensive Systems > > alexanderpre...@ververica.com > > <https://www.ververica.com/> > > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > > Ververica GmbH > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason, > Jinwei (Kevin) Zhang > >