[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014006#comment-15014006 ]
Zach Cox commented on KAFKA-1451: --------------------------------- [~fpj] Yes we saw the "I wrote this conflicted ephemeral node" error messages, we saw lots of partitions in/out of ISRs and a lot of this too: {code} [2015-11-19 01:05:51,685] INFO Opening socket connection to server ip-10-10-1-35.ec2.internal/10.10.1.35:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,685] INFO Socket connection established to ip-10-10-1-35.ec2.internal/10.10.1.35:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,687] INFO Unable to reconnect to ZooKeeper service, session 0x54a0e5799a8195d has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,687] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2015-11-19 01:05:51,687] INFO Initiating client connection, connectString=zookeeper1.production.redacted.com:2181,zookeeper2.production.redacted.com:2181,zookeeper3.production.redacted.com:2181/kafka sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@ace1333 (org.apache.zookeeper.ZooKeeper) [2015-11-19 01:05:51,701] INFO EventThread shut down (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,701] ERROR Error handling event ZkEvent[New session event sent to kafka.controller.KafkaController$SessionExpirationListener@2261adb8] (org.I0Itec.zkclient.ZkEventThread) java.lang.IllegalStateException: Kafka scheduler has not been started at kafka.utils.KafkaScheduler.ensureStarted(KafkaScheduler.scala:114) at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:86) at kafka.controller.KafkaController.onControllerResignation(KafkaController.scala:350) at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply$mcZ$sp(KafkaController.scala:1108) at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1107) at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1107) at kafka.utils.Utils$.inLock(Utils.scala:535) at kafka.controller.KafkaController$SessionExpirationListener.handleNewSession(KafkaController.scala:1107) at org.I0Itec.zkclient.ZkClient$4.run(ZkClient.java:472) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) [2015-11-19 01:05:51,701] INFO re-registering broker info in ZK for broker 3 (kafka.server.KafkaHealthcheck) [2015-11-19 01:05:51,701] INFO Opening socket connection to server ip-10-10-1-104.ec2.internal/10.10.1.104:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,702] INFO Socket connection established to ip-10-10-1-104.ec2.internal/10.10.1.104:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,713] INFO Session establishment complete on server ip-10-10-1-104.ec2.internal/10.10.1.104:2181, sessionid = 0x64a0e57972a1a85, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,713] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2015-11-19 01:05:51,718] INFO Registered broker 3 at path /brokers/ids/3 with address mesos-slave3.production.redacted.com:9092. (kafka.utils.ZkUtils$) [2015-11-19 01:05:51,718] INFO done re-registering broker (kafka.server.KafkaHealthcheck) [2015-11-19 01:05:51,718] INFO Subscribing to /brokers/topics path to watch for new topics (kafka.server.KafkaHealthcheck) [2015-11-19 01:05:51,721] INFO New leader is 1 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) {code} > Broker stuck due to leader election race > ----------------------------------------- > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.8.1.1 > Reporter: Maciek Makowski > Assignee: Manikumar Reddy > Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)