Ricky Burnett created FLINK-22516:
-------------------------------------
Summary: ResourceManager cannot establish leadership
Key: FLINK-22516
URL: https://issues.apache.org/jira/browse/FLINK-22516
Project: Flink
Issue Type: Bug
Reporter: Ricky Burnett
Attachments: jobmanager_leadership.log
We are running Flink clusters with 2 Jobmanagers in HA mode. After a Zookeeper
restart the two JMs begin leadership election end up in state where they are
both trying to start their ResourceManager and until one of them writes to
`leader/<jobid>/resource_manager_lock` and the JobMaster proceeds to execute
`notifyOfNewResourceManagerLeader` which restarts the ResourceManager. This in
turn writes to `leader/<jobid>/resource_manager_lock` which triggers the other
JobMaster to restart it's ResourceManager. We can see this in the logs from
the "ResourceManager leader changed to new address" log, that goes back and
forth between the two JMs and the two IP addresses. This cycle appears to
continue indefinitely with outside interruption.
I've attached combined logs from two JMs in our environment that got into this
state. The logs start with the loss of connection and end with a couple of
cycles of back and forth. The two relevant hosts are
"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7" and
"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-mpf9x".
*-tsxb7 appears to be the last host that was granted leadership.
{code:java}
{"thread":"Curator-Framework-0-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobManagerRunner","message":"JobManager
runner for job tenant: ssademo, pipeline:
828d4aa2-d4d4-457b-995d-feb56d08c1fb, name: integration-test-detection
(33e12948df69077ab3b33316eacbb5e4) was granted leadership with session id
97992805-9c60-40ba-8260-aaf036694cde at
akka.tcp://[email protected]:6123/user/jobmanager_3.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","instant":{"epochSecond":1617129712,"nanoOfSecond":447000000},"contextMap":{},"threadId":152,"threadPriority":5,"source":{"class":"org.apache.flink.runtime.jobmaster.JobManagerRunner","method":"startJobMaster","file":"JobManagerRunner.java","line":313},"service":"streams","time":"2021-03-30T18:41:52.447UTC","hostname":"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7"}
{code}
But *-mpf9x continues to try to wrestle control back.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)