Re: failure-rate restart strategy not working?

2017-01-09 Thread Aljoscha Krettek
Hi,
did you create a Jira issue for this? (I'm just getting up to speed after
vacation so sorry if you already did this, I didn't yet read the Jira mail.)

Cheers,
Aljoscah

On Fri, 6 Jan 2017 at 19:08 Stephan Ewen <se...@apache.org> wrote:

> I think you are right, enabling checkpointing should not override the
> cluster settings per se.
>
> This is probably an unwanted artifact of the was that configuration
> currently works: Setting explicitly set in the program trump the
> cluster-defaults (in the config). Since activating checkpointing sets a
> strategy in the ExecutionConfig (program), it overrides the cluster default.
>
> It is definitely not intended in that case. For that specific case, it
> makes to simply leave the restart strategy "undefined" and use the "fixed
> delay" one at runtime if none other is specified.
>
> Stephan
>
>
>
>
> On Fri, Jan 6, 2017 at 6:44 PM, Shannon Carey <sca...@expedia.com> wrote:
>
> I think I figured it out: the problem is due to Flink's behavior when a
> job has checkpointing enabled.
>
> When the job graph is created, if checkpointing is enabled but a restart
> strategy hasn't been programmatically configured, Flink changes the job
> graph's execution config to use the fixed delay restart strategy. That gets
> serialized with the job graph. Then, when the JobManager deserializes the
> execution config, it sees that there's a restart strategy configured for
> the job and uses that instead of using the restart strategy that's
> configured on the cluster.
>
> Clearly, the documentation definitely needs to be adjusted. Maybe I can
> add some changes to https://github.com/apache/flink/pull/3059
>
> However, should we also consider some implementation changes? Is it
> intentional that enabling checkpoint overrides the restart strategy set on
> the cluster, and that the only way to control the restart strategy on a
> checkpointed job is to set it programmatically? If not, then would it be
> reasonable to only set fixed-delay restart strategy if checkpointing is
> enabled AND the cluster doesn't explicitly configure it? Flink would no
> longer be use the execution config to control the strategy, but would
> instead do it in the JobManager#submitJob().
>
> -Shannon
>
> From: Shannon Carey <sca...@expedia.com>
> Date: Thursday, January 5, 2017 at 1:50 PM
> To: "user@flink.apache.org" <user@flink.apache.org>
> Subject: failure-rate restart strategy not working?
>
> I recently updated my cluster with the following config:
>
> restart-strategy: failure-rate
> restart-strategy.failure-rate.max-failures-per-interval: 3
> restart-strategy.failure-rate.failure-rate-interval: 5 min
> restart-strategy.failure-rate.delay: 10 s
>
> I see the settings inside the JobManager web UI, as expected. I am not
> setting the restart-strategy programmatically, but the job does have
> checkpointing enabled.
>
> However, if I launch a job that (intentionally) fails every 10 seconds by
> throwing a RuntimeException, it continues to restart beyond the limit of 3
> failures.
>
> Does anyone know why this might be happening? Any ideas of things I could
> check?
>
> Thanks!
> Shannon
>
>
>


Re: failure-rate restart strategy not working?

2017-01-06 Thread Shannon Carey
I think I figured it out: the problem is due to Flink's behavior when a job has 
checkpointing enabled.

When the job graph is created, if checkpointing is enabled but a restart 
strategy hasn't been programmatically configured, Flink changes the job graph's 
execution config to use the fixed delay restart strategy. That gets serialized 
with the job graph. Then, when the JobManager deserializes the execution 
config, it sees that there's a restart strategy configured for the job and uses 
that instead of using the restart strategy that's configured on the cluster.

Clearly, the documentation definitely needs to be adjusted. Maybe I can add 
some changes to https://github.com/apache/flink/pull/3059

However, should we also consider some implementation changes? Is it intentional 
that enabling checkpoint overrides the restart strategy set on the cluster, and 
that the only way to control the restart strategy on a checkpointed job is to 
set it programmatically? If not, then would it be reasonable to only set 
fixed-delay restart strategy if checkpointing is enabled AND the cluster 
doesn't explicitly configure it? Flink would no longer be use the execution 
config to control the strategy, but would instead do it in the 
JobManager#submitJob().

-Shannon

From: Shannon Carey <sca...@expedia.com<mailto:sca...@expedia.com>>
Date: Thursday, January 5, 2017 at 1:50 PM
To: "user@flink.apache.org<mailto:user@flink.apache.org>" 
<user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: failure-rate restart strategy not working?

I recently updated my cluster with the following config:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s

I see the settings inside the JobManager web UI, as expected. I am not setting 
the restart-strategy programmatically, but the job does have checkpointing 
enabled.

However, if I launch a job that (intentionally) fails every 10 seconds by 
throwing a RuntimeException, it continues to restart beyond the limit of 3 
failures.

Does anyone know why this might be happening? Any ideas of things I could check?

Thanks!
Shannon


failure-rate restart strategy not working?

2017-01-05 Thread Shannon Carey
I recently updated my cluster with the following config:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s

I see the settings inside the JobManager web UI, as expected. I am not setting 
the restart-strategy programmatically, but the job does have checkpointing 
enabled.

However, if I launch a job that (intentionally) fails every 10 seconds by 
throwing a RuntimeException, it continues to restart beyond the limit of 3 
failures.

Does anyone know why this might be happening? Any ideas of things I could check?

Thanks!
Shannon