[ https://issues.apache.org/jira/browse/FLINK-9371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483003#comment-16483003 ]
Mike Urbach commented on FLINK-9371: ------------------------------------ I have encountered this too, specifically the error stating "because there is currently no valid leader id known". I have not encountered "because the expected leader session ID ... did not equal the received leader session ID". Does anyone have some insight on this issue? I have not dug into it at all. We run our cluster on Kubernetes, and I have worked around this by deleting all of our resources, cleaning up the state in Zookeeper, and re-creating everything from scratch. If it is helpful, I can provide more info about our setup. > High Availability JobManager Registration Failure > ------------------------------------------------- > > Key: FLINK-9371 > URL: https://issues.apache.org/jira/browse/FLINK-9371 > Project: Flink > Issue Type: Bug > Components: Core > Affects Versions: 1.4.2 > Reporter: Jason Kania > Priority: Major > > The following error is happening intermittently on an 3 node cluster with 2 > Job Managers configured in HA mode. When this happens, the two JobManager > instances are associated with one another. > 2018-05-15 19:00:06,400 INFO > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > - Trying to associate with JobManager leader > akka.tcp://flink@aaa-1:50000/user/jobmanager > 2018-05-15 19:00:06,404 WARN org.apache.flink.runtime.jobmanager.JobManager > - Discard message > LeaderSessionMessage(0bbe70c4-2642-4a08-912f-6cc09646281f,RegisterResourceManager > akka://flink/user/resourcemanager-d6567c5d-85f4-4b18-8eac-cf9725d076a5) > because there is currently no valid leader id known. > 2018-05-15 19:00:16,418 ERROR > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > - Resource manager could not register at JobManager > akka.pattern.AskTimeoutException: Ask timed out on > [ActorSelection[Anchor(akka://flink/), Path(/user/jobmanager)]] after [10000 > ms]. Sender[null] sent message of type > "org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage". > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) > at java.lang.Thread.run(Thread.java:748) > > Sometimes the following type of log also comes out following the previous log: > 2018-05-15 19:13:47,525 WARN > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > - Discard message > LeaderSessionMessage(5cab29b9-10d3-4b25-b934-f06b82be15b5,TriggerRegistrationAtJobManager > akka.tcp://flink@aaa-1:50000/user/jobmanager) because the expected leader > session ID 61075587-51da-4e58-ac4f-9ea118ccdde9 did not equal the received > leader session ID 5cab29b9-10d3-4b25-b934-f06b82be15b5. -- This message was sent by Atlassian JIRA (v7.6.3#76005)