[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306753#comment-14306753 ] justdebugit commented on KAFKA-1451: I think the patch dose not RESOLVE the problem,it will be happen again when zk event notification request arrive long after the /controller has deleted. kafkaController.onControllerResignation method has executed,but elect action is return when it find that "/controller" znode has already been created > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346728#comment-14346728 ] Aldrin Seychell commented on KAFKA-1451: I just encountered this issue on version 0.8.2.0 after a period of slow PC performance and perhaps zookeeper and kafka were slow to communicate between each other (possibly the same issue highlighted by [~sinewy]. This resulted in an infinite loop in attempting to created the ephemeral node. The logs that were continously being written are as follows: [2015-03-03 13:48:19,831] INFO conflict in /brokers/ids/0 data: {"jmx_port":-1,"timestamp":"1425386833617","host":"MTDKP119.ix.com","version":1,"port":9092} stored data: {"jmx_port":-1,"timestamp":"1425380575230","host":"MTDKP119.ix.com","version":1,"port":9092} (kafka.utils.ZkUtils$) [2015-03-03 13:48:19,832] INFO I wrote this conflicted ephemeral node [{"jmx_port":-1,"timestamp":"1425386833617","host":"MTDKP119.ix.com","version":1,"port":9092}] at /brokers/ids/0 a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) [2015-03-03 13:48:25,844] INFO conflict in /brokers/ids/0 data: {"jmx_port":-1,"timestamp":"1425386833617","host":"MTDKP119.ix.com","version":1,"port":9092} stored data: {"jmx_port":-1,"timestamp":"1425380575230","host":"MTDKP119.ix.com","version":1,"port":9092} (kafka.utils.ZkUtils$) > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516532#comment-14516532 ] Marcus Aidley commented on KAFKA-1451: -- I have also hit this issue on version 0.8.2.0. It occurred directly after Zookeeper got restarted: [2015-04-27 03:47:03,291] INFO conflict in /brokers/ids/2 data: {"jmx_port":-1,"timestamp":"1430038275477","host":"ams5mdppdmsbacmq01b.markit.partners","version":1,"port":9092} stored data: {"jmx_port":-1,"timestamp":"1430036480690","host":"ams5mdppdmsbacmq01b.markit.partners","version":1,"port":9092} (kafka.utils.ZkUtils$) [2015-04-27 03:47:03,292] INFO I wrote this conflicted ephemeral node [{"jmx_port":-1,"timestamp":"1430038275477","host":"ams5mdppdmsbacmq01b.markit.partners","version":1,"port":9092}] at /brokers/ids/2 a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593515#comment-14593515 ] Raghav commented on KAFKA-1451: --- Hit this issue on version 0.8.2.1 when twiiterstream generate the large data i have one topic with two broker and two partition [2015-06-19 20:35:10,141] INFO I wrote this conflicted ephemeral node [{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}] at /brokers/ids/2 a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) [2015-06-19 20:35:16,246] INFO conflict in /brokers/ids/2 data: {"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093} stored data: {"jmx_port":1,"timestamp":"1434726044184","host":"localhost.localdomain","version":1,"port":9093} (kafka.utils.ZkUtils$) [2015-06-19 20:35:16,796] INFO I wrote this conflicted ephemeral node [{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}] at /brokers/ids/2 a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) [2015-06-19 20:35:22,965] INFO conflict in /brokers/ids/2 data: {"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093} stored data: {"jmx_port":1,"timestamp":"1434726044184","host":"localhost.localdomain","version":1,"port":9093} (kafka.utils.ZkUtils$) [2015-06-19 20:35:22,967] INFO I wrote this conflicted ephemeral node [{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}] at /brokers/ids/2 a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) [2015-06-19 20:35:29,159] INFO conflict in /brokers/ids/2 data: {"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093} stored data: {"jmx_port":1,"timestamp":"1434726044184","host":"localhost.localdomain","version":1,"port":9093} (kafka.utils.ZkUtils$) [2015-06-19 20:35:29,161] INFO I wrote this conflicted ephemeral node [{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}] at /brokers/ids/2 a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) [2015-06-19 20:35:35,219] INFO conflict in /brokers/ids/2 data: {"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093} stored data: {"jmx_port":1,"timestamp":"1434726044184","host":"localhost.localdomain","version":1,"port":9093} (kafka.utils.ZkUtils$) [2015-06-19 20:35:35,221] INFO I wrote this conflicted ephemeral node [{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}] at /brokers/ids/2 a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) [2015-06-19 20:35:41,338] INFO conflict in /brokers/ids/2 data: {"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093} stored data: {"jmx_port":1,"timestamp":"1434726044184","host":"localhost.localdomain","version":1,"port":9093} (kafka.utils.ZkUtils$) [2015-06-19 20:35:42,208] INFO I wrote this conflicted ephemeral node [{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}] at /brokers/ids/2 a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Step
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713405#comment-14713405 ] Jason Kania commented on KAFKA-1451: I too am seeing this issue in 0.8.2.1. > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791763#comment-14791763 ] dude commented on KAFKA-1451: - Also occurred in 3 node kafka 0.8.2.1 cluster > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944726#comment-14944726 ] XiangChen commented on KAFKA-1451: -- also hit in 0.8.2.1,and the /controller node in zk is lost. > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945382#comment-14945382 ] Jiangjie Qin commented on KAFKA-1451: - [~laxpio] May be related to KAFKA-2437. > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945391#comment-14945391 ] Flavio Junqueira commented on KAFKA-1451: - Maybe related to KAFKA-1387? > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013883#comment-15013883 ] Zach Cox commented on KAFKA-1451: - We experienced this yesterday on a 3-node 0.8.2.1 cluster, which caused a major outage for several hours. Restarting Kafka brokers several times, along with restarting Zookeeper nodes, did not resolve the issue. We identified one of the brokers that seemed to be going in/out of ISRs repeatedly, and ended up deleting all of its state on disk & restarting it. This was the only thing that finally resolved the issue. Maybe there was some corrupt state on that broker's disk? We still have that broker's state (moved its data dir, didn't actually delete) if that is helpful at all. > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013983#comment-15013983 ] Flavio Junqueira commented on KAFKA-1451: - [~zcox] if you observed messages like the ones in this comment above https://issues.apache.org/jira/browse/KAFKA-1451?focusedCommentId=14593515&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14593515 then I suspect this will be resolved with the fix of KAFKA-1387, which will be available in 0.9. > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014006#comment-15014006 ] Zach Cox commented on KAFKA-1451: - [~fpj] Yes we saw the "I wrote this conflicted ephemeral node" error messages, we saw lots of partitions in/out of ISRs and a lot of this too: {code} [2015-11-19 01:05:51,685] INFO Opening socket connection to server ip-10-10-1-35.ec2.internal/10.10.1.35:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,685] INFO Socket connection established to ip-10-10-1-35.ec2.internal/10.10.1.35:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,687] INFO Unable to reconnect to ZooKeeper service, session 0x54a0e5799a8195d has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,687] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2015-11-19 01:05:51,687] INFO Initiating client connection, connectString=zookeeper1.production.redacted.com:2181,zookeeper2.production.redacted.com:2181,zookeeper3.production.redacted.com:2181/kafka sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@ace1333 (org.apache.zookeeper.ZooKeeper) [2015-11-19 01:05:51,701] INFO EventThread shut down (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,701] ERROR Error handling event ZkEvent[New session event sent to kafka.controller.KafkaController$SessionExpirationListener@2261adb8] (org.I0Itec.zkclient.ZkEventThread) java.lang.IllegalStateException: Kafka scheduler has not been started at kafka.utils.KafkaScheduler.ensureStarted(KafkaScheduler.scala:114) at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:86) at kafka.controller.KafkaController.onControllerResignation(KafkaController.scala:350) at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply$mcZ$sp(KafkaController.scala:1108) at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1107) at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1107) at kafka.utils.Utils$.inLock(Utils.scala:535) at kafka.controller.KafkaController$SessionExpirationListener.handleNewSession(KafkaController.scala:1107) at org.I0Itec.zkclient.ZkClient$4.run(ZkClient.java:472) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) [2015-11-19 01:05:51,701] INFO re-registering broker info in ZK for broker 3 (kafka.server.KafkaHealthcheck) [2015-11-19 01:05:51,701] INFO Opening socket connection to server ip-10-10-1-104.ec2.internal/10.10.1.104:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,702] INFO Socket connection established to ip-10-10-1-104.ec2.internal/10.10.1.104:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,713] INFO Session establishment complete on server ip-10-10-1-104.ec2.internal/10.10.1.104:2181, sessionid = 0x64a0e57972a1a85, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2015-11-19 01:05:51,713] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2015-11-19 01:05:51,718] INFO Registered broker 3 at path /brokers/ids/3 with address mesos-slave3.production.redacted.com:9092. (kafka.utils.ZkUtils$) [2015-11-19 01:05:51,718] INFO done re-registering broker (kafka.server.KafkaHealthcheck) [2015-11-19 01:05:51,718] INFO Subscribing to /brokers/topics path to watch for new topics (kafka.server.KafkaHealthcheck) [2015-11-19 01:05:51,721] INFO New leader is 1 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) {code} > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2.0 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps t
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064976#comment-14064976 ] Kenny commented on KAFKA-1451: -- This can also be caused by restarting Kafka quickly after a sigkill. I had a supervisord config file with 'stopwaitsecs=1' and it would pretty reliably create a hung Kafka process. > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Priority: Minor > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065055#comment-14065055 ] Jun Rao commented on KAFKA-1451: Thanks for reporting this. Very interesting. That does sound like a potential problem. The problem is that ZookeeperLeaderElector.elect assumes that no controller exists. However, this may not be true. One possible solution is to first check the existence of the controller from ZK before creating the ephemeral node. > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Priority: Minor > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066456#comment-14066456 ] Neha Narkhede commented on KAFKA-1451: -- Just checking the existence is not enough since there is a risk of not electing a controller at all if all brokers do the same and the node disappears. Following will work 1. Register watch 2. Check existence and elect if one does not exist #1 ensures that if the node disappears, an election will take place > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Priority: Minor > Labels: newbie > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068193#comment-14068193 ] Jun Rao commented on KAFKA-1451: Neha, I am not sure if #1 is need. We can get into elect from two paths (1) from startup or (2) from handleDeleted. If it's from startup, we already register the watcher before calling elect. If it's from handleDeleted, it means that the watcher must have already been registered. So, once in elect, we know the watcher is already registered. So if after we check the existence of the controller node and the controller node goes away immediately afterward, the watcher is guaranteed to be triggered. > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Priority: Minor > Labels: newbie > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075317#comment-14075317 ] Manikumar Reddy commented on KAFKA-1451: Created reviewboard https://reviews.apache.org/r/23962/diff/ against branch origin/trunk > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Priority: Minor > Labels: newbie > Attachments: KAFKA-1451.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075321#comment-14075321 ] Manikumar Reddy commented on KAFKA-1451: Uploaded a patch which checks controller existence in leader election process. With this patch i am not able to reproduce the issue. > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Attachments: KAFKA-1451.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076270#comment-14076270 ] Manikumar Reddy commented on KAFKA-1451: Updated reviewboard https://reviews.apache.org/r/23962/diff/ against branch origin/trunk > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:17:21.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076274#comment-14076274 ] Manikumar Reddy commented on KAFKA-1451: Created reviewboard https://reviews.apache.org/r/23983/diff/ against branch origin/trunk > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Attachments: KAFKA-1451.patch, KAFKA-1451.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076277#comment-14076277 ] Manikumar Reddy commented on KAFKA-1451: Updated reviewboard https://reviews.apache.org/r/23962/diff/ against branch origin/trunk > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077385#comment-14077385 ] Manikumar Reddy commented on KAFKA-1451: Updated reviewboard https://reviews.apache.org/r/23962/diff/ against branch origin/trunk > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092051#comment-14092051 ] Joe Stein commented on KAFKA-1451: -- Hi, two issues so far where found with leader election https://issues.apache.org/jira/browse/KAFKA-1387?focusedCommentId=14087063&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14087063 I don't know if the issues are related to each other or even to this just yet... the issues found were not happening on the 0.8.1 branch could be another 0.8.2 patch I supose but before I started trying to test on a 0.8.2 version without this patch (to isolate the root cause) I wanted to see if this type of scenario was tested or what thoughts were in general to this patch and how it might be affecting either of the two issues found in 0.8.2 trunk? > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092253#comment-14092253 ] Jun Rao commented on KAFKA-1451: Joe, KAFKA-1387 seems to be related to broker registration and this jira only fixes how the controller is registered in ZK. So, I am not sure if they are related. > Broker stuck due to leader election race > - > > Key: KAFKA-1451 > URL: https://issues.apache.org/jira/browse/KAFKA-1451 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Maciek Makowski >Assignee: Manikumar Reddy >Priority: Minor > Labels: newbie > Fix For: 0.8.2 > > Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, > KAFKA-1451_2014-07-29_10:13:23.patch > > > h3. Symptoms > The broker does not become available due to being stuck in an infinite loop > while electing leader. This can be recognised by the following line being > repeatedly written to server.log: > {code} > [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node > [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a > while back in a different session, hence I will backoff for this node to be > deleted by Zookeeper and retry (kafka.utils.ZkUtils$) > {code} > h3. Steps to Reproduce > In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely > behave the same with the ZK version included in Kafka distribution) node > setup: > # start both zookeeper and kafka (in any order) > # stop zookeeper > # stop kafka > # start kafka > # start zookeeper > h3. Likely Cause > {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then > triggers an election. if the deletion of ephemeral {{/controller}} node > associated with previous zookeeper session of the broker happens after > subscription to changes in new session, election will be invoked twice, once > from {{startup}} and once from {{handleDataDeleted}}: > * {{startup}}: acquire {{controllerLock}} > * {{startup}}: subscribe to data changes > * zookeeper: delete {{/controller}} since the session that created it timed > out > * {{handleDataDeleted}}: {{/controller}} was deleted > * {{handleDataDeleted}}: wait on {{controllerLock}} > * {{startup}}: elect -- writes {{/controller}} > * {{startup}}: release {{controllerLock}} > * {{handleDataDeleted}}: acquire {{controllerLock}} > * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then > gets into infinite loop as a result of conflict > {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing > znode was written from different session, which is not true in this case; it > was written from the same session. That adds to the confusion. > h3. Suggested Fix > In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe > to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)