Re: Flink failure rate restart not work as expect

2022-03-02 Thread Zhilong Hong
Hi, Jiaqiao:

Since your job enables checkpoint, you can just try to remove the restart
strategy config. The default value will be fixed-delay with
Integer.MAX_VALUE restart attempts and '1 s' delay, as mentioned in [1]. In
this way when a failover occurs, your job will wait for 1 seconds before it
restarts. Since the value of max restart attempts is Integer.MAX_VALUE, the
job will not transition to FAILED unless a fatal error occurs.

Best,
Zhilong

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#restart-strategy

On Wed, Mar 2, 2022 at 1:55 PM 刘 家锹  wrote:

> Hi, all
>
> I think we may find the reason, that's relate to the '
> *jobmanager.execution.failover-strategy*' configuration and the job
> region numbers. In our case, we set failover-strategy to 'region' and
> this job has 6 regions running on only one TaskManager. So when the
> container goes down, every regions need to be restart because they belong
> to this only one TaskManager.
> That's easy to tell that 4 retry times is not enough for 6 regions, so
> this job quit is reasonable.
> Also, why my testing job didn't quit, that's because this job is kind of
> different, it only has one region, so the behavior also expected.
>
> For us, we change failover-stratety to 'full', since most of our jobs has
> only one TaskManager and topology is simple. It will be helpful in most
> case. Further more, combine with region failover, that's kind of complex to
> configure a right parameter, we apply it to complex job only.
>
> If has any best practice about pipelined-region failover restart or
> document about region that would be helpfull.
>
> Again, thx for your time to reply, that help us a lot.
> --------------
> *发件人:* 刘 家锹 
> *发送时间:* 2022年3月1日 23:06
> *收件人:* Matthias Pohl ; user ;
> David Morávek 
> *主题:* Re: Flink failure rate restart not work as expect
>
> I realized I missed mentioning something above, the container exit code is
> 163, which is not the normal code, at least I can’t find any meaning from
> google. So, my test didn’t cover this situation, I don’t know whether it
> impacts the results.
>
> 获取 Outlook for iOS <https://aka.ms/o0ukef>
> ----------
> *发件人:* 刘 家锹 
> *发送时间:* Tuesday, March 1, 2022 10:23:50 PM
> *收件人:* Matthias Pohl ; user ;
> David Morávek 
> *主题:* Re: Flink failure rate restart not work as expect
>
> We didn't find any obvious configuration issues in our cluster. As far as
> I know, It works fine in most cases; I also simulate failover under current
> configuration, by starting a new job with only one TaskManager, then kill
> the TaskManager container, and this job recovery from failures
> successfully.
> As you said, yarn logs look it may have some problems, we try digging into
> it to see if we can find any hints.
>
> 获取 Outlook for iOS <https://aka.ms/o0ukef>
> --
> *发件人:* Matthias Pohl 
> *发送时间:* Tuesday, March 1, 2022 9:50:36 PM
> *收件人:* 刘 家锹 ; user ;
> David Morávek 
> *主题:* Re: Flink failure rate restart not work as expect
>
> The YARN node manager logs support my observation: The container exits
> with a failure which, if I understand it correctly, should cause a
> container restart on the YARN side. In HA mode, Flink expects the
> underlying resource management to restart the Flink cluster in case of
> failure. This does not seem to happen in your case. Is there a
> configuration issue in your YARN cluster? Or does the container recovery
> usually work in failure cases for you? I'm not that experienced with YARN
> deployments. I'm adding David to this thread. He might have some additional
> insights.
>
> Matthias
>
> On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹  wrote:
>
> Unfortunately we did't keep log properly , this happen too far away, yarn
> ResourceMnager log had clean,  and the broken machine had reinstall. We
> only found the yarn log of JobManager on Yarn NodeManager, it maybe
> useless. We will put the detail logs to this thread when it happen again,
> since it happen sometime, like between two weeks,  if one of our cluster
> machine go down.
> --
> *发件人:* Matthias Pohl 
> *发送时间:* 2022年3月1日 17:57
> *收件人:* Alexander Preuß 
> *抄送:* 刘 家锹 ; user@flink.apache.org <
> user@flink.apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> Hi,
> I second Alex' observation - based on the logs it looks like the task
> restart functionality worked as expected: It tried to restart the tasks
> until it reached the limit of 4 attempts due to the missing TaskManager.
> The job-cluster shut down with an error code

回复: Flink failure rate restart not work as expect

2022-03-01 Thread 刘 家锹
Hi, all

I think we may find the reason, that's relate to the 
'jobmanager.execution.failover-strategy' configuration and the job region 
numbers. In our case, we set failover-strategy to 'region' and this job has 6 
regions running on only one TaskManager. So when the container goes down, every 
regions need to be restart because they belong to this only one TaskManager.
That's easy to tell that 4 retry times is not enough for 6 regions, so this job 
quit is reasonable.
Also, why my testing job didn't quit, that's because this job is kind of 
different, it only has one region, so the behavior also expected.

For us, we change failover-stratety to 'full', since most of our jobs has only 
one TaskManager and topology is simple. It will be helpful in most case. 
Further more, combine with region failover, that's kind of complex to configure 
a right parameter, we apply it to complex job only.

If has any best practice about pipelined-region failover restart or document 
about region that would be helpfull.

Again, thx for your time to reply, that help us a lot.

发件人: 刘 家锹 
发送时间: 2022年3月1日 23:06
收件人: Matthias Pohl ; user ; 
David Morávek 
主题: Re: Flink failure rate restart not work as expect

I realized I missed mentioning something above, the container exit code is 163, 
which is not the normal code, at least I can’t find any meaning from google. 
So, my test didn’t cover this situation, I don’t know whether it impacts the 
results.

获取 Outlook for iOS<https://aka.ms/o0ukef>

发件人: 刘 家锹 
发送时间: Tuesday, March 1, 2022 10:23:50 PM
收件人: Matthias Pohl ; user ; 
David Morávek 
主题: Re: Flink failure rate restart not work as expect

We didn't find any obvious configuration issues in our cluster. As far as I 
know, It works fine in most cases; I also simulate failover under current 
configuration, by starting a new job with only one TaskManager, then kill the 
TaskManager container, and this job recovery from failures successfully.
As you said, yarn logs look it may have some problems, we try digging into it 
to see if we can find any hints.

获取 Outlook for iOS<https://aka.ms/o0ukef>

发件人: Matthias Pohl 
发送时间: Tuesday, March 1, 2022 9:50:36 PM
收件人: 刘 家锹 ; user ; David 
Morávek 
主题: Re: Flink failure rate restart not work as expect

The YARN node manager logs support my observation: The container exits with a 
failure which, if I understand it correctly, should cause a container restart 
on the YARN side. In HA mode, Flink expects the underlying resource management 
to restart the Flink cluster in case of failure. This does not seem to happen 
in your case. Is there a configuration issue in your YARN cluster? Or does the 
container recovery usually work in failure cases for you? I'm not that 
experienced with YARN deployments. I'm adding David to this thread. He might 
have some additional insights.

Matthias

On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹 
mailto:ljq1120799...@outlook.com>> wrote:
Unfortunately we did't keep log properly , this happen too far away, yarn 
ResourceMnager log had clean,  and the broken machine had reinstall. We only 
found the yarn log of JobManager on Yarn NodeManager, it maybe useless. We will 
put the detail logs to this thread when it happen again, since it happen 
sometime, like between two weeks,  if one of our cluster machine go down.

发件人: Matthias Pohl mailto:matth...@ververica.com>>
发送时间: 2022年3月1日 17:57
收件人: Alexander Preuß 
mailto:alexanderpre...@ververica.com>>
抄送: 刘 家锹 mailto:ljq1120799...@outlook.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org> 
mailto:user@flink.apache.org>>
主题: Re: Flink failure rate restart not work as expect

Hi,
I second Alex' observation - based on the logs it looks like the task restart 
functionality worked as expected: It tried to restart the tasks until it 
reached the limit of 4 attempts due to the missing TaskManager. The job-cluster 
shut down with an error code. At this point, YARN should pick it up and bring 
up a new JobManager based on the non-0 exit code of the Flink cluster. It would 
be interesting to see the YARN logs to figure out why the cluster failover 
didn't work.

Best,
Matthias

On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß 
mailto:alexanderpre...@ververica.com>> wrote:
Hi,
from a first glance it looks like the exception was thrown very rapidly so it 
exceeded the maxFailuresPerInterval and the FailureRestartStrategy decided not 
to restart. Why do you think this is different from the expected behavior?

Best,
Alex

On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 
mailto:ljq1120799...@outlook.com>> wrote:
Hi, all
We encounter some problem with FailureRateRestartStrategy, which confuse us and 
don't know how to solove it. Here's the situation:

Flink ver

Re: Flink failure rate restart not work as expect

2022-03-01 Thread 刘 家锹
I realized I missed mentioning something above, the container exit code is 163, 
which is not the normal code, at least I can’t find any meaning from google. 
So, my test didn’t cover this situation, I don’t know whether it impacts the 
results.

获取 Outlook for iOS<https://aka.ms/o0ukef>

发件人: 刘 家锹 
发送时间: Tuesday, March 1, 2022 10:23:50 PM
收件人: Matthias Pohl ; user ; 
David Morávek 
主题: Re: Flink failure rate restart not work as expect

We didn't find any obvious configuration issues in our cluster. As far as I 
know, It works fine in most cases; I also simulate failover under current 
configuration, by starting a new job with only one TaskManager, then kill the 
TaskManager container, and this job recovery from failures successfully.
As you said, yarn logs look it may have some problems, we try digging into it 
to see if we can find any hints.

获取 Outlook for iOS<https://aka.ms/o0ukef>

发件人: Matthias Pohl 
发送时间: Tuesday, March 1, 2022 9:50:36 PM
收件人: 刘 家锹 ; user ; David 
Morávek 
主题: Re: Flink failure rate restart not work as expect

The YARN node manager logs support my observation: The container exits with a 
failure which, if I understand it correctly, should cause a container restart 
on the YARN side. In HA mode, Flink expects the underlying resource management 
to restart the Flink cluster in case of failure. This does not seem to happen 
in your case. Is there a configuration issue in your YARN cluster? Or does the 
container recovery usually work in failure cases for you? I'm not that 
experienced with YARN deployments. I'm adding David to this thread. He might 
have some additional insights.

Matthias

On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹 
mailto:ljq1120799...@outlook.com>> wrote:
Unfortunately we did't keep log properly , this happen too far away, yarn 
ResourceMnager log had clean,  and the broken machine had reinstall. We only 
found the yarn log of JobManager on Yarn NodeManager, it maybe useless. We will 
put the detail logs to this thread when it happen again, since it happen 
sometime, like between two weeks,  if one of our cluster machine go down.

发件人: Matthias Pohl mailto:matth...@ververica.com>>
发送时间: 2022年3月1日 17:57
收件人: Alexander Preuß 
mailto:alexanderpre...@ververica.com>>
抄送: 刘 家锹 mailto:ljq1120799...@outlook.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org> 
mailto:user@flink.apache.org>>
主题: Re: Flink failure rate restart not work as expect

Hi,
I second Alex' observation - based on the logs it looks like the task restart 
functionality worked as expected: It tried to restart the tasks until it 
reached the limit of 4 attempts due to the missing TaskManager. The job-cluster 
shut down with an error code. At this point, YARN should pick it up and bring 
up a new JobManager based on the non-0 exit code of the Flink cluster. It would 
be interesting to see the YARN logs to figure out why the cluster failover 
didn't work.

Best,
Matthias

On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß 
mailto:alexanderpre...@ververica.com>> wrote:
Hi,
from a first glance it looks like the exception was thrown very rapidly so it 
exceeded the maxFailuresPerInterval and the FailureRestartStrategy decided not 
to restart. Why do you think this is different from the expected behavior?

Best,
Alex

On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 
mailto:ljq1120799...@outlook.com>> wrote:
Hi, all
We encounter some problem with FailureRateRestartStrategy, which confuse us and 
don't know how to solove it. Here's the situation:

Flink version: 1.10.1
Development env: on Yarn
FailureRateRestartStrategy: 
failuresIntervalMS=6,backoffTimeMS=15000,maxFailuresPerInterval=4

One of our hadoop machine got stuck without response, which our job's 
taskmanager running on. At this moment, the jobmanager receive a heartbeat 
timeout exception, but after throwing 4 times exception in a very short 
time(about 10ms each), it hit the FailureRateRestartStrategy and all job quit, 
we got the message of 'org.apache.flink.runtime.JobException: Recovery is 
suppressed by FailureRateRestartBackoffTimeStrategy'.
As I know from document, the behavior expected was jobmanager should try to 
restart the job which will bring up a new taskmanager on other machine, but it 
did not.
We also do some test, start a new job and just kill the taskamanger, but it can 
restart as expect.

So it confuse us most,  if anyone know what happen, that would be thanks.

JobManager log and TaskManager log append below


--

Alexander Preuß | Junior Engineer - Data Intensive Systems

alexanderpre...@ververica.com<mailto:alexanderpre...@ververica.com>

[https://lh4.googleusercontent.com/NPTiLXYOUlWRdjeXe6hdOe_UvXESdi5aTB7HzziTY19ReOdVh04c4ED8DPqLmLHRlTiWHdtIjvMzFEUh0eoY7vOO_xTTAGmOxwlSQfwGN6tBbjSimj-eh5v094v1KHk5XOOoSBbU=s0]<h

Re: Flink failure rate restart not work as expect

2022-03-01 Thread 刘 家锹
We didn't find any obvious configuration issues in our cluster. As far as I 
know, It works fine in most cases; I also simulate failover under current 
configuration, by starting a new job with only one TaskManager, then kill the 
TaskManager container, and this job recovery from failures successfully.
As you said, yarn logs look it may have some problems, we try digging into it 
to see if we can find any hints.

获取 Outlook for iOS<https://aka.ms/o0ukef>

发件人: Matthias Pohl 
发送时间: Tuesday, March 1, 2022 9:50:36 PM
收件人: 刘 家锹 ; user ; David 
Morávek 
主题: Re: Flink failure rate restart not work as expect

The YARN node manager logs support my observation: The container exits with a 
failure which, if I understand it correctly, should cause a container restart 
on the YARN side. In HA mode, Flink expects the underlying resource management 
to restart the Flink cluster in case of failure. This does not seem to happen 
in your case. Is there a configuration issue in your YARN cluster? Or does the 
container recovery usually work in failure cases for you? I'm not that 
experienced with YARN deployments. I'm adding David to this thread. He might 
have some additional insights.

Matthias

On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹 
mailto:ljq1120799...@outlook.com>> wrote:
Unfortunately we did't keep log properly , this happen too far away, yarn 
ResourceMnager log had clean,  and the broken machine had reinstall. We only 
found the yarn log of JobManager on Yarn NodeManager, it maybe useless. We will 
put the detail logs to this thread when it happen again, since it happen 
sometime, like between two weeks,  if one of our cluster machine go down.

发件人: Matthias Pohl mailto:matth...@ververica.com>>
发送时间: 2022年3月1日 17:57
收件人: Alexander Preuß 
mailto:alexanderpre...@ververica.com>>
抄送: 刘 家锹 mailto:ljq1120799...@outlook.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org> 
mailto:user@flink.apache.org>>
主题: Re: Flink failure rate restart not work as expect

Hi,
I second Alex' observation - based on the logs it looks like the task restart 
functionality worked as expected: It tried to restart the tasks until it 
reached the limit of 4 attempts due to the missing TaskManager. The job-cluster 
shut down with an error code. At this point, YARN should pick it up and bring 
up a new JobManager based on the non-0 exit code of the Flink cluster. It would 
be interesting to see the YARN logs to figure out why the cluster failover 
didn't work.

Best,
Matthias

On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß 
mailto:alexanderpre...@ververica.com>> wrote:
Hi,
from a first glance it looks like the exception was thrown very rapidly so it 
exceeded the maxFailuresPerInterval and the FailureRestartStrategy decided not 
to restart. Why do you think this is different from the expected behavior?

Best,
Alex

On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 
mailto:ljq1120799...@outlook.com>> wrote:
Hi, all
We encounter some problem with FailureRateRestartStrategy, which confuse us and 
don't know how to solove it. Here's the situation:

Flink version: 1.10.1
Development env: on Yarn
FailureRateRestartStrategy: 
failuresIntervalMS=6,backoffTimeMS=15000,maxFailuresPerInterval=4

One of our hadoop machine got stuck without response, which our job's 
taskmanager running on. At this moment, the jobmanager receive a heartbeat 
timeout exception, but after throwing 4 times exception in a very short 
time(about 10ms each), it hit the FailureRateRestartStrategy and all job quit, 
we got the message of 'org.apache.flink.runtime.JobException: Recovery is 
suppressed by FailureRateRestartBackoffTimeStrategy'.
As I know from document, the behavior expected was jobmanager should try to 
restart the job which will bring up a new taskmanager on other machine, but it 
did not.
We also do some test, start a new job and just kill the taskamanger, but it can 
restart as expect.

So it confuse us most,  if anyone know what happen, that would be thanks.

JobManager log and TaskManager log append below


--

Alexander Preuß | Junior Engineer - Data Intensive Systems

alexanderpre...@ververica.com<mailto:alexanderpre...@ververica.com>

[https://lh4.googleusercontent.com/NPTiLXYOUlWRdjeXe6hdOe_UvXESdi5aTB7HzziTY19ReOdVh04c4ED8DPqLmLHRlTiWHdtIjvMzFEUh0eoY7vOO_xTTAGmOxwlSQfwGN6tBbjSimj-eh5v094v1KHk5XOOoSBbU=s0]<https://www.ververica.com/>


Follow us @VervericaData

--

Join Flink Forward<https://flink-forward.org/> - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH

Registered at Amtsgericht Charlottenburg: HRB 158244 B

Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason, 
Jinwei (Kevin) Zhang



Re: Flink failure rate restart not work as expect

2022-03-01 Thread Matthias Pohl
The YARN node manager logs support my observation: The container exits with
a failure which, if I understand it correctly, should cause a container
restart on the YARN side. In HA mode, Flink expects the underlying resource
management to restart the Flink cluster in case of failure. This does not
seem to happen in your case. Is there a configuration issue in your YARN
cluster? Or does the container recovery usually work in failure cases for
you? I'm not that experienced with YARN deployments. I'm adding David to
this thread. He might have some additional insights.

Matthias

On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹  wrote:

> Unfortunately we did't keep log properly , this happen too far away, yarn
> ResourceMnager log had clean,  and the broken machine had reinstall. We
> only found the yarn log of JobManager on Yarn NodeManager, it maybe
> useless. We will put the detail logs to this thread when it happen again,
> since it happen sometime, like between two weeks,  if one of our cluster
> machine go down.
> --
> *发件人:* Matthias Pohl 
> *发送时间:* 2022年3月1日 17:57
> *收件人:* Alexander Preuß 
> *抄送:* 刘 家锹 ; user@flink.apache.org <
> user@flink.apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> Hi,
> I second Alex' observation - based on the logs it looks like the task
> restart functionality worked as expected: It tried to restart the tasks
> until it reached the limit of 4 attempts due to the missing TaskManager.
> The job-cluster shut down with an error code. At this point, YARN should
> pick it up and bring up a new JobManager based on the non-0 exit code of
> the Flink cluster. It would be interesting to see the YARN logs to figure
> out why the cluster failover didn't work.
>
> Best,
> Matthias
>
> On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <
> alexanderpre...@ververica.com> wrote:
>
> Hi,
> from a first glance it looks like the exception was thrown very rapidly so
> it exceeded the maxFailuresPerInterval and the FailureRestartStrategy
> decided not to restart. Why do you think this is different from the
> expected behavior?
>
> Best,
> Alex
>
> On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹  wrote:
>
> Hi, all
> We encounter some problem with FailureRateRestartStrategy, which confuse
> us and don't know how to solove it. Here's the situation:
>
> Flink version: 1.10.1
> Development env: on Yarn
>
> FailureRateRestartStrategy: 
> failuresIntervalMS=6,backoffTimeMS=15000,maxFailuresPerInterval=4
>
> One of our hadoop machine got stuck without response, which our job's
> taskmanager running on. At this moment, the jobmanager receive a heartbeat
> timeout exception, but after throwing 4 times exception in a very short
> time(about 10ms each), it hit the FailureRateRestartStrategy and all job
> quit, we got the message of 'org.apache.flink.runtime.JobException:
> Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
> As I know from document, the behavior expected was jobmanager should try
> to restart the job which will bring up a new taskmanager on other machine,
> but it did not.
> We also do some test, start a new job and just kill the taskamanger, but
> it can restart as expect.
>
> So it confuse us most,  if anyone know what happen, that would be thanks.
>
> JobManager log and TaskManager log append below
>
>
>
> --
>
> Alexander Preuß | Junior Engineer - Data Intensive Systems
>
> alexanderpre...@ververica.com
>
> <https://www.ververica.com/>
>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
>
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>
>


Re: Flink failure rate restart not work as expect

2022-03-01 Thread Matthias Pohl
Hi,
I second Alex' observation - based on the logs it looks like the task
restart functionality worked as expected: It tried to restart the tasks
until it reached the limit of 4 attempts due to the missing TaskManager.
The job-cluster shut down with an error code. At this point, YARN should
pick it up and bring up a new JobManager based on the non-0 exit code of
the Flink cluster. It would be interesting to see the YARN logs to figure
out why the cluster failover didn't work.

Best,
Matthias

On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <
alexanderpre...@ververica.com> wrote:

> Hi,
> from a first glance it looks like the exception was thrown very rapidly so
> it exceeded the maxFailuresPerInterval and the FailureRestartStrategy
> decided not to restart. Why do you think this is different from the
> expected behavior?
>
> Best,
> Alex
>
> On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹  wrote:
>
>> Hi, all
>> We encounter some problem with FailureRateRestartStrategy, which confuse
>> us and don't know how to solove it. Here's the situation:
>>
>> Flink version: 1.10.1
>> Development env: on Yarn
>>
>> FailureRateRestartStrategy: 
>> failuresIntervalMS=6,backoffTimeMS=15000,maxFailuresPerInterval=4
>>
>> One of our hadoop machine got stuck without response, which our job's
>> taskmanager running on. At this moment, the jobmanager receive a heartbeat
>> timeout exception, but after throwing 4 times exception in a very short
>> time(about 10ms each), it hit the FailureRateRestartStrategy and all job
>> quit, we got the message of 'org.apache.flink.runtime.JobException:
>> Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
>> As I know from document, the behavior expected was jobmanager should try
>> to restart the job which will bring up a new taskmanager on other machine,
>> but it did not.
>> We also do some test, start a new job and just kill the taskamanger, but
>> it can restart as expect.
>>
>> So it confuse us most,  if anyone know what happen, that would be thanks.
>>
>> JobManager log and TaskManager log append below
>>
>
>
> --
>
> Alexander Preuß | Junior Engineer - Data Intensive Systems
>
> alexanderpre...@ververica.com
>
> 
>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
>
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>
>


Re: Flink failure rate restart not work as expect

2022-02-28 Thread Alexander Preuß
Hi,
from a first glance it looks like the exception was thrown very rapidly so
it exceeded the maxFailuresPerInterval and the FailureRestartStrategy
decided not to restart. Why do you think this is different from the
expected behavior?

Best,
Alex

On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹  wrote:

> Hi, all
> We encounter some problem with FailureRateRestartStrategy, which confuse
> us and don't know how to solove it. Here's the situation:
>
> Flink version: 1.10.1
> Development env: on Yarn
>
> FailureRateRestartStrategy: 
> failuresIntervalMS=6,backoffTimeMS=15000,maxFailuresPerInterval=4
>
> One of our hadoop machine got stuck without response, which our job's
> taskmanager running on. At this moment, the jobmanager receive a heartbeat
> timeout exception, but after throwing 4 times exception in a very short
> time(about 10ms each), it hit the FailureRateRestartStrategy and all job
> quit, we got the message of 'org.apache.flink.runtime.JobException:
> Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
> As I know from document, the behavior expected was jobmanager should try
> to restart the job which will bring up a new taskmanager on other machine,
> but it did not.
> We also do some test, start a new job and just kill the taskamanger, but
> it can restart as expect.
>
> So it confuse us most,  if anyone know what happen, that would be thanks.
>
> JobManager log and TaskManager log append below
>


-- 

Alexander Preuß | Junior Engineer - Data Intensive Systems

alexanderpre...@ververica.com




Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH

Registered at Amtsgericht Charlottenburg: HRB 158244 B

Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
Jinwei (Kevin) Zhang


Flink failure rate restart not work as expect

2022-02-28 Thread 刘 家锹
Hi, all
We encounter some problem with FailureRateRestartStrategy, which confuse us and 
don't know how to solove it. Here's the situation:

Flink version: 1.10.1
Development env: on Yarn
FailureRateRestartStrategy: 
failuresIntervalMS=6,backoffTimeMS=15000,maxFailuresPerInterval=4

One of our hadoop machine got stuck without response, which our job's 
taskmanager running on. At this moment, the jobmanager receive a heartbeat 
timeout exception, but after throwing 4 times exception in a very short 
time(about 10ms each), it hit the FailureRateRestartStrategy and all job quit, 
we got the message of 'org.apache.flink.runtime.JobException: Recovery is 
suppressed by FailureRateRestartBackoffTimeStrategy'.
As I know from document, the behavior expected was jobmanager should try to 
restart the job which will bring up a new taskmanager on other machine, but it 
did not.
We also do some test, start a new job and just kill the taskamanger, but it can 
restart as expect.

So it confuse us most,  if anyone know what happen, that would be thanks.

JobManager log and TaskManager log append below


JobManager.log
Description: JobManager.log


TaskManager.log
Description: TaskManager.log