[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)

2019-02-28 Thread Robert Metzger (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-10475:
---
Component/s: Runtime / Coordination

> Standalone HA - Leader election is not triggered on loss of leader (ZK 
> 3.5.3-beta only)
> ---
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.5.4, 1.6.1
>Reporter: Thomas Wozniakowski
>Priority: Minor
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Summary: Standalone HA - Leader election is not triggered on loss of leader 
(ZK 3.5.3-beta only)  (was: Standalone HA - Leader election is not triggered on 
loss of leader)

> Standalone HA - Leader election is not triggered on loss of leader (ZK 
> 3.5.3-beta only)
> ---
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Priority: Minor  (was: Blocker)

> Standalone HA - Leader election is not triggered on loss of leader (ZK 
> 3.5.3-beta only)
> ---
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Minor
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
Happy to see that the issue of jobgraphs hanging around forever has been 
resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
previously worked with 1.4.3. 

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
Happy to see that the issue of jobgraphs hanging around forever has been 
resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 


> Standalone HA - Leader election is not triggered on loss of leader
> --
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Affects Version/s: 1.6.1

> Standalone HA - Leader election is not triggered on loss of leader
> --
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
> jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
> mode, but now I'm seeing a different issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
Happy to see that the issue of jobgraphs hanging around forever has been 
resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 


> Standalone HA - Leader election is not triggered on loss of leader
> --
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
{quote}

Please give me a shout if I can provide any more useful information

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum 

[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
{quote}

Please give me a shout if I can provide any more useful information

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFenced

[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Attachment: t1.log
t2.log
t3.log

> Standalone HA - Leader election is not triggered on loss of leader
> --
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
> jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
> mode, but now I'm seeing a different issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> The logs of the remaining job managers were full of this:
> {quote}
> 2018-10-01 15:35:44,558 ERROR 
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
> retrieve the redirect address.
> java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: 
> Ask timed out on 
> [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
> [1 ms]. Sender[null] sent message of type 
> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>   at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>   at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
>   at akka.dispatch.OnComplete.internal(Future.scala:258)
>   at akka.dispatch.OnComplete.internal(Future.scala:256)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>   at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>   at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
>   at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>   at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>   at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}
> Please give me a shout if I can provide any more useful information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-01 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
{quote}

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

```
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExcep

[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-01 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
{quote}

Please give me a shout if I can provide any more useful information

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.j