[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-02-04 Thread justdebugit (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306753#comment-14306753
 ] 

justdebugit commented on KAFKA-1451:


I think the patch dose not RESOLVE the problem,it will be happen again when zk 
event notification request arrive long after the /controller   has  deleted.
kafkaController.onControllerResignation method has executed,but elect action is 
return when it find that "/controller" znode has already been created

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-03-04 Thread Aldrin Seychell (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346728#comment-14346728
 ] 

Aldrin Seychell commented on KAFKA-1451:


I just encountered this issue on version 0.8.2.0 after a period of slow PC 
performance and perhaps zookeeper and kafka were slow to communicate between 
each other (possibly the same issue highlighted by [~sinewy].  This resulted in 
an infinite loop in attempting to created the ephemeral node.  The logs that 
were continously being written are as follows:

[2015-03-03 13:48:19,831] INFO conflict in /brokers/ids/0 data: 
{"jmx_port":-1,"timestamp":"1425386833617","host":"MTDKP119.ix.com","version":1,"port":9092}
 stored data: 
{"jmx_port":-1,"timestamp":"1425380575230","host":"MTDKP119.ix.com","version":1,"port":9092}
 (kafka.utils.ZkUtils$)

[2015-03-03 13:48:19,832] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":-1,"timestamp":"1425386833617","host":"MTDKP119.ix.com","version":1,"port":9092}]
 at /brokers/ids/0 a while back in a different session, hence I will backoff 
for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)

[2015-03-03 13:48:25,844] INFO conflict in /brokers/ids/0 data: 
{"jmx_port":-1,"timestamp":"1425386833617","host":"MTDKP119.ix.com","version":1,"port":9092}
 stored data: 
{"jmx_port":-1,"timestamp":"1425380575230","host":"MTDKP119.ix.com","version":1,"port":9092}
 (kafka.utils.ZkUtils$)

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-04-27 Thread Marcus Aidley (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516532#comment-14516532
 ] 

Marcus Aidley commented on KAFKA-1451:
--

I have also hit this issue on version 0.8.2.0. It occurred directly after 
Zookeeper got restarted:

[2015-04-27 03:47:03,291] INFO conflict in /brokers/ids/2 data: 
{"jmx_port":-1,"timestamp":"1430038275477","host":"ams5mdppdmsbacmq01b.markit.partners","version":1,"port":9092}
 stored data: 
{"jmx_port":-1,"timestamp":"1430036480690","host":"ams5mdppdmsbacmq01b.markit.partners","version":1,"port":9092}
 (kafka.utils.ZkUtils$)
[2015-04-27 03:47:03,292] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":-1,"timestamp":"1430038275477","host":"ams5mdppdmsbacmq01b.markit.partners","version":1,"port":9092}]
 at /brokers/ids/2 a while back in a different session, hence I will backoff 
for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-06-19 Thread Raghav (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593515#comment-14593515
 ] 

Raghav commented on KAFKA-1451:
---

Hit this issue on version 0.8.2.1 when twiiterstream generate the large data i 
have one topic with two broker and two partition


[2015-06-19 20:35:10,141] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}]
 at /brokers/ids/2 a while back in a different session, hence I will backoff 
for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
[2015-06-19 20:35:16,246] INFO conflict in /brokers/ids/2 data: 
{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}
 stored data: 
{"jmx_port":1,"timestamp":"1434726044184","host":"localhost.localdomain","version":1,"port":9093}
 (kafka.utils.ZkUtils$)
[2015-06-19 20:35:16,796] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}]
 at /brokers/ids/2 a while back in a different session, hence I will backoff 
for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
[2015-06-19 20:35:22,965] INFO conflict in /brokers/ids/2 data: 
{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}
 stored data: 
{"jmx_port":1,"timestamp":"1434726044184","host":"localhost.localdomain","version":1,"port":9093}
 (kafka.utils.ZkUtils$)
[2015-06-19 20:35:22,967] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}]
 at /brokers/ids/2 a while back in a different session, hence I will backoff 
for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
[2015-06-19 20:35:29,159] INFO conflict in /brokers/ids/2 data: 
{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}
 stored data: 
{"jmx_port":1,"timestamp":"1434726044184","host":"localhost.localdomain","version":1,"port":9093}
 (kafka.utils.ZkUtils$)
[2015-06-19 20:35:29,161] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}]
 at /brokers/ids/2 a while back in a different session, hence I will backoff 
for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
[2015-06-19 20:35:35,219] INFO conflict in /brokers/ids/2 data: 
{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}
 stored data: 
{"jmx_port":1,"timestamp":"1434726044184","host":"localhost.localdomain","version":1,"port":9093}
 (kafka.utils.ZkUtils$)
[2015-06-19 20:35:35,221] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}]
 at /brokers/ids/2 a while back in a different session, hence I will backoff 
for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
[2015-06-19 20:35:41,338] INFO conflict in /brokers/ids/2 data: 
{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}
 stored data: 
{"jmx_port":1,"timestamp":"1434726044184","host":"localhost.localdomain","version":1,"port":9093}
 (kafka.utils.ZkUtils$)
[2015-06-19 20:35:42,208] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":1,"timestamp":"1434726183806","host":"localhost.localdomain","version":1,"port":9093}]
 at /brokers/ids/2 a while back in a different session, hence I will backoff 
for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)


> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Step

[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-08-26 Thread Jason Kania (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713405#comment-14713405
 ] 

Jason Kania commented on KAFKA-1451:


I too am seeing this issue in 0.8.2.1.

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-09-17 Thread dude (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791763#comment-14791763
 ] 

dude commented on KAFKA-1451:
-

Also occurred in 3 node kafka 0.8.2.1 cluster

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-10-06 Thread XiangChen (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944726#comment-14944726
 ] 

XiangChen commented on KAFKA-1451:
--

also hit in 0.8.2.1,and the /controller node in zk is lost.

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-10-06 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945382#comment-14945382
 ] 

Jiangjie Qin commented on KAFKA-1451:
-

[~laxpio] May be related to KAFKA-2437.

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-10-06 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945391#comment-14945391
 ] 

Flavio Junqueira commented on KAFKA-1451:
-

Maybe related to KAFKA-1387?

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-11-19 Thread Zach Cox (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013883#comment-15013883
 ] 

Zach Cox commented on KAFKA-1451:
-

We experienced this yesterday on a 3-node 0.8.2.1 cluster, which caused a major 
outage for several hours. Restarting Kafka brokers several times, along with 
restarting Zookeeper nodes, did not resolve the issue. We identified one of the 
brokers that seemed to be going in/out of ISRs repeatedly, and ended up 
deleting all of its state on disk & restarting it. This was the only thing that 
finally resolved the issue. Maybe there was some corrupt state on that broker's 
disk? We still have that broker's state (moved its data dir, didn't actually 
delete) if that is helpful at all.

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-11-19 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013983#comment-15013983
 ] 

Flavio Junqueira commented on KAFKA-1451:
-

[~zcox] if you observed messages like the ones in this comment above

https://issues.apache.org/jira/browse/KAFKA-1451?focusedCommentId=14593515&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14593515

then I suspect this will be resolved with the fix of KAFKA-1387, which will be 
available in 0.9.

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2015-11-19 Thread Zach Cox (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014006#comment-15014006
 ] 

Zach Cox commented on KAFKA-1451:
-

[~fpj] Yes we saw the "I wrote this conflicted ephemeral node" error messages, 
we saw lots of partitions in/out of ISRs and a lot of this too:

{code}
[2015-11-19 01:05:51,685] INFO Opening socket connection to server 
ip-10-10-1-35.ec2.internal/10.10.1.35:2181. Will not attempt to authenticate 
using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,685] INFO Socket connection established to 
ip-10-10-1-35.ec2.internal/10.10.1.35:2181, initiating session 
(org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,687] INFO Unable to reconnect to ZooKeeper service, 
session 0x54a0e5799a8195d has expired, closing socket connection 
(org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,687] INFO zookeeper state changed (Expired) 
(org.I0Itec.zkclient.ZkClient)
[2015-11-19 01:05:51,687] INFO Initiating client connection, 
connectString=zookeeper1.production.redacted.com:2181,zookeeper2.production.redacted.com:2181,zookeeper3.production.redacted.com:2181/kafka
 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@ace1333 
(org.apache.zookeeper.ZooKeeper)
[2015-11-19 01:05:51,701] INFO EventThread shut down 
(org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,701] ERROR Error handling event ZkEvent[New session event 
sent to kafka.controller.KafkaController$SessionExpirationListener@2261adb8] 
(org.I0Itec.zkclient.ZkEventThread)
java.lang.IllegalStateException: Kafka scheduler has not been started
  at kafka.utils.KafkaScheduler.ensureStarted(KafkaScheduler.scala:114)
  at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:86)
  at 
kafka.controller.KafkaController.onControllerResignation(KafkaController.scala:350)
  at 
kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply$mcZ$sp(KafkaController.scala:1108)
  at 
kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1107)
  at 
kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1107)
  at kafka.utils.Utils$.inLock(Utils.scala:535)
  at 
kafka.controller.KafkaController$SessionExpirationListener.handleNewSession(KafkaController.scala:1107)
  at org.I0Itec.zkclient.ZkClient$4.run(ZkClient.java:472)
  at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
[2015-11-19 01:05:51,701] INFO re-registering broker info in ZK for broker 3 
(kafka.server.KafkaHealthcheck)
[2015-11-19 01:05:51,701] INFO Opening socket connection to server 
ip-10-10-1-104.ec2.internal/10.10.1.104:2181. Will not attempt to authenticate 
using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,702] INFO Socket connection established to 
ip-10-10-1-104.ec2.internal/10.10.1.104:2181, initiating session 
(org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,713] INFO Session establishment complete on server 
ip-10-10-1-104.ec2.internal/10.10.1.104:2181, sessionid = 0x64a0e57972a1a85, 
negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,713] INFO zookeeper state changed (SyncConnected) 
(org.I0Itec.zkclient.ZkClient)
[2015-11-19 01:05:51,718] INFO Registered broker 3 at path /brokers/ids/3 with 
address mesos-slave3.production.redacted.com:9092. (kafka.utils.ZkUtils$)
[2015-11-19 01:05:51,718] INFO done re-registering broker 
(kafka.server.KafkaHealthcheck)
[2015-11-19 01:05:51,718] INFO Subscribing to /brokers/topics path to watch for 
new topics (kafka.server.KafkaHealthcheck)
[2015-11-19 01:05:51,721] INFO New leader is 1 
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
{code}

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps t

[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-07-17 Thread Kenny (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064976#comment-14064976
 ] 

Kenny commented on KAFKA-1451:
--

This can also be caused by restarting Kafka quickly after a sigkill. I had a 
supervisord config file with 'stopwaitsecs=1' and it would pretty reliably 
create a hung Kafka process.

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Priority: Minor
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-07-17 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065055#comment-14065055
 ] 

Jun Rao commented on KAFKA-1451:


Thanks for reporting this. Very interesting. That does sound like a potential 
problem. The problem is that ZookeeperLeaderElector.elect assumes that no 
controller exists. However, this may not be true. One possible solution is to 
first check the existence of the controller from ZK before creating the 
ephemeral node. 

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Priority: Minor
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-07-18 Thread Neha Narkhede (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066456#comment-14066456
 ] 

Neha Narkhede commented on KAFKA-1451:
--

Just checking the existence is not enough since there is a risk of not electing 
a controller at all if all brokers do the same and the node disappears. 
Following will work
1. Register watch
2. Check existence and elect if one does not exist

#1 ensures that if the node disappears, an election will take place

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Priority: Minor
>  Labels: newbie
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-07-20 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068193#comment-14068193
 ] 

Jun Rao commented on KAFKA-1451:


Neha, I am not sure if #1 is need. We can get into elect from two paths (1) 
from startup or (2) from handleDeleted. If it's from startup, we already 
register the watcher before calling elect. If it's from handleDeleted, it means 
that the watcher must have already been registered. So, once in elect, we know 
the watcher is already registered. So if after we check the existence of the 
controller node and the controller node goes away immediately afterward, the 
watcher is guaranteed to be triggered.

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Priority: Minor
>  Labels: newbie
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-07-26 Thread Manikumar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075317#comment-14075317
 ] 

Manikumar Reddy commented on KAFKA-1451:


Created reviewboard https://reviews.apache.org/r/23962/diff/
 against branch origin/trunk

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-1451.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-07-26 Thread Manikumar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075321#comment-14075321
 ] 

Manikumar Reddy commented on KAFKA-1451:


Uploaded a patch which checks controller existence in leader election process.
With this patch i am not able to reproduce the issue.


> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-1451.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-07-28 Thread Manikumar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076270#comment-14076270
 ] 

Manikumar Reddy commented on KAFKA-1451:


Updated reviewboard https://reviews.apache.org/r/23962/diff/
 against branch origin/trunk

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:17:21.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-07-28 Thread Manikumar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076274#comment-14076274
 ] 

Manikumar Reddy commented on KAFKA-1451:


Created reviewboard https://reviews.apache.org/r/23983/diff/
 against branch origin/trunk

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-1451.patch, KAFKA-1451.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-07-28 Thread Manikumar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076277#comment-14076277
 ] 

Manikumar Reddy commented on KAFKA-1451:


Updated reviewboard https://reviews.apache.org/r/23962/diff/
 against branch origin/trunk

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-07-28 Thread Manikumar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077385#comment-14077385
 ] 

Manikumar Reddy commented on KAFKA-1451:


Updated reviewboard https://reviews.apache.org/r/23962/diff/
 against branch origin/trunk

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-08-10 Thread Joe Stein (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092051#comment-14092051
 ] 

Joe Stein commented on KAFKA-1451:
--

Hi, two issues so far where found with leader election 
https://issues.apache.org/jira/browse/KAFKA-1387?focusedCommentId=14087063&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14087063
 I don't know if the issues are related to each other or even to this just 
yet... the issues found were not happening on the 0.8.1 branch could be 
another 0.8.2 patch I supose but before I started trying to test on a 0.8.2 
version without this patch (to isolate the root cause) I wanted to see if this 
type of scenario was tested or what thoughts were in general to this patch and 
how it might be affecting either of the two issues found in 0.8.2 trunk?  

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race

2014-08-10 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092253#comment-14092253
 ] 

Jun Rao commented on KAFKA-1451:


Joe,

KAFKA-1387 seems to be related to broker registration and this jira only fixes 
how the controller is registered in ZK. So, I am not sure if they are related.

> Broker stuck due to leader election race 
> -
>
> Key: KAFKA-1451
> URL: https://issues.apache.org/jira/browse/KAFKA-1451
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1.1
>Reporter: Maciek Makowski
>Assignee: Manikumar Reddy
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8.2
>
> Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)