[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791763#comment-14791763 ]
dude commented on KAFKA-1451: ----------------------------- Also occurred in 3 node kafka 0.8.2.1 cluster > Broker stuck due to leader election race > ----------------------------------------- > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.8.1.1 > Reporter: Maciek Makowski > Assignee: Manikumar Reddy > Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)