[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

Thomas Wozniakowski (JIRA) Mon, 01 Oct 2018 08:38:39 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Wozniakowski updated FLINK-10475:
----------------------------------------
    Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[10000 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
        at akka.dispatch.OnComplete.internal(Future.scala:258)
        at akka.dispatch.OnComplete.internal(Future.scala:256)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
        at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
        at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
        at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
        at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
        at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
        at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
        at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
        at java.lang.Thread.run(Thread.java:745)
{quote}

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

```
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[10000 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
        at akka.dispatch.OnComplete.internal(Future.scala:258)
        at akka.dispatch.OnComplete.internal(Future.scala:256)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
        at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
        at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
        at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
        at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
        at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
        at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
        at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
        at java.lang.Thread.run(Thread.java:745)
```


> Standalone HA - Leader election is not triggered on loss of leader
> ------------------------------------------------------------------
>
>                 Key: FLINK-10475
>                 URL: https://issues.apache.org/jira/browse/FLINK-10475
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.5.4
>            Reporter: Thomas Wozniakowski
>            Priority: Blocker
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
> jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
> mode, but now I'm seeing a different issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> The logs of the remaining job managers were full of this:
> {quote}
> 2018-10-01 15:35:44,558 ERROR 
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
> retrieve the redirect address.
> java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: 
> Ask timed out on 
> [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
> [10000 ms]. Sender[null] sent message of type 
> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>       at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>       at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>       at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
>       at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
>       at akka.dispatch.OnComplete.internal(Future.scala:258)
>       at akka.dispatch.OnComplete.internal(Future.scala:256)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
>       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>       at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
>       at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>       at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
>       at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>       at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>       at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>       at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

Reply via email to