[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-02-16 Thread Tommy Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323229#comment-14323229
 ] 

Tommy Becker commented on KAFKA-1387:
-

Can a project member comment on what it is going to take to get this patch 
accepted?  We have been running 0.8.1 with it for months, and I guess we'll 
have to apply it to 0.8.2 as well, but it would be nice to get it into the 
official tree...

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>  Labels: newbie, patch
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-04-27 Thread Thomas Omans (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516102#comment-14516102
 ] 

Thomas Omans commented on KAFKA-1387:
-

I am seeing similar behavior in my consumer, using kafka 0.8.2.1 and zookeeper 
3.4.6

In an infinite loop:

{code}
15/04/27 17:44:31 INFO utils.ZkUtils$: conflict in /consumers/**
15/04/27 17:44:31 INFO utils.ZkUtils$: I wrote this conflicted ephemeral node 
** a while back in a different session, hence I will backoff for 
this node to be deleted by Zookeeper and retry
15/04/27 17:45:01 INFO INFO utils.ZkUtils$: conflict in 
/consumers/**
15/04/27 17:45:01 INFO utils.ZkUtils$: I wrote this conflicted ephemeral node 
** a while back in a different session, hence I will backoff for 
this node to be deleted by Zookeeper and retry
{code}

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>  Labels: newbie, patch
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-04-29 Thread Marcus Aidley (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14519189#comment-14519189
 ] 

Marcus Aidley commented on KAFKA-1387:
--

I've also encountered this issue running Kafka 0.8.2.0 and Zookeeper 3.4.6 in a 
three node cluster. The error occured after two zookeeper nodes got restarted 
at the same time. The error below repeatedly appeared in the Kafka logs. I 
resolved the issue by restarting Kafka.

{panel}
[2015-04-27 03:47:03,292] INFO I wrote this conflicted ephemeral node 
["jmx_port":-1,"timestamp":"1430038275477","host":"ams5mdppdmsbacmq01b.markit.partners","version":1,"port":9092]
 at /brokers/ids/2 a while back in a different session, hence I will backoff 
for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
{panel}


> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>  Labels: newbie, patch
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-04-29 Thread Thomas Omans (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520867#comment-14520867
 ] 

Thomas Omans commented on KAFKA-1387:
-

It looks like this "infinite retry" behavior is only in kafka to accomodate 
another strange issue where zookeeper was deleting ephemeral nodes out from 
under it:

https://github.com/apache/kafka/blob/0.8.2.1/core/src/main/scala/kafka/utils/ZkUtils.scala#L272
https://issues.apache.org/jira/browse/ZOOKEEPER-1740

It seems the simplest thing to do would be to just delete the conflicted node 
and write the truth about the process environment it knows.

I see that my issue appeared in the consumer code, where this issue is 
occurring in the kafka brokers themselves, but the bug appears to be the same:

There are two exceptional cases in ephemeral nodes that I can see, either the 
ZOOKEEPER-1740 bug was hit in which case our ephemeral node mysteriously was 
lost out from under us, but our session is still active and we can just create 
a new one. The other bug I believe we are seeing is that the session is long 
gone but the ephemeral node is still hanging around until the consumer process 
exits.

Currently the first case is handled, but I the second case is not.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>  Labels: newbie, patch
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-05-07 Thread Abhishek Nigam (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533402#comment-14533402
 ] 

Abhishek Nigam commented on KAFKA-1387:
---

I have seen the ephemeral node issue before and the fix made there was exactly 
what Thomas mentioned:
"It seems the simplest thing to do would be to just delete the conflicted node 
and write the truth about the process environment it knows."

Is there a reason why the approach outlined by Thomas does not work for kafka?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-07-30 Thread Clark Haskins (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648378#comment-14648378
 ] 

Clark Haskins commented on KAFKA-1387:
--

This patch is listed as a blocker. Can the existing patch be committed? Is 
anyone actively working on it? 

This has been a problem for us recently and we would like to see this fixed 
soon.

Thanks,
-Clark

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-11 Thread Mayuresh Gharat (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682263#comment-14682263
 ] 

Mayuresh Gharat commented on KAFKA-1387:


Can the person who uploaded the patch submit a testcase on how to reproduce 
this? 
We are hitting this in production but are not able to reproduce this locally.



> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-11 Thread Fedor Korotkiy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682269#comment-14682269
 ] 

Fedor Korotkiy commented on KAFKA-1387:
---

Have you tried steps from issue description?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-11 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682416#comment-14682416
 ] 

Guozhang Wang commented on KAFKA-1387:
--

[~fpj] Could you help taking a look at this issue?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-11 Thread James Lent (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692595#comment-14692595
 ] 

James Lent commented on KAFKA-1387:
---

It has been a while since I investigated this issue. I will take another look 
at it tomorrow and get back to you. 

Sent from my iPhone



> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-12 Thread James Lent (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694052#comment-14694052
 ] 

James Lent commented on KAFKA-1387:
---

After refreshing my memory of this issue I was unable to come up with any new 
ideas for how to create an automated test case for the issue.  I was only able 
to reproduce this issue in my dev environment using the cumbersome manual 
process I outlined in my Sept 27 comment.

My question posted to the zookeeper-user mailing list regarding the validity of 
the key assumption of the patch logic generated no feedback.

We have been using the patch I provided with Kafka 0.8.1.1 for almost a year 
now.  We have not seen a re-occurrence of the hung ephemeral connection issue 
since then.  Since the problem was intermittent and only triggered when the 
system was unstable, this may or may not be due to the presence of the patch.

There was one an NPE issue found during test in March when our application code 
changed and in certain cases tried to close a Connector that had never been 
fully started.  That was fixed as follows:

{noformat}
Index: core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala
===
--- core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala 
(revision 73668)
+++ core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala 
(revision 73669)
@@ -162,7 +162,9 @@
   if (canShutdown) {
 info("ZKConsumerConnector shutting down")
 
-consumerNodeMonitor.close()
+if (consumerNodeMonitor != null) {
+  consumerNodeMonitor.close()
+}
 
 if (wildcardTopicWatcher != null)
   wildcardTopicWatcher.shutdown()
{noformat}

Not sure any of this was of much help, but, I would be happy to try to answer 
any questions regarding the patch logic and/or update it based on your comments.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-14 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697002#comment-14697002
 ] 

Flavio Junqueira commented on KAFKA-1387:
-

I'm actually really sorry that this issue has been around for so long, I didn't 
realize it was going on and that I was even indirectly participating in it. Let 
me start by giving a sort of general overview of what to expect.

If a client has received a session expiration event, it means that the leader 
has expired the session and has broadcast the closeSession event to the 
followers. If the same client creates a new session successfully, then the 
server it connects to must have applied the previous closeSession, which 
deletes the ephemeral znodes, because ZK guarantees that txns are totally 
ordered. Consequently, the client shouldn't observe an ephemeral from an old 
session of its own. Note that another client could still observe the ephemeral 
znode after the session expiration if it is connected to a server that is a bit 
behind, but that's fine.

What I'm thinking is that one problem that could happen is that a client 
creates a new session before receiving the session expiration for an earlier 
session. In that case the ephemerals will still be there because the session 
still exists.

The bottom line is that if the client has seen the session expiration event, 
then it seems fine to go ahead and create new ephemerals without having to 
check whether ephemerals are stale or not. If the session creation isn't clean, 
then there are a few options like waiting for the timeout period, storing and 
recovering the session id.

I'll dig into the code to see how we can fix this, have a closer look at the 
patch, and will reopen the associated ZOOKEEPER-1740 issue until we sort this 
out. let me know if the explanation above makes sense in the meanwhile. 

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-14 Thread Abhishek Nigam (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697427#comment-14697427
 ] 

Abhishek Nigam commented on KAFKA-1387:
---

Thanks a lot for digging into this. Not sure if it helps but in the past
when I saw this issue it went like this:
a) Say session time out is 30 seconds.
b) If we kill the instance which create the zookeeper ephemeral node and
bring it back up quickly (less than 30 seconds) we would find the previous
session data (ephemeral node) still exists.

The solution was to assume the existing data was from an old session,
delete and re-create it during startup. However, we were processing the
zookeeper events on a single thread.

On Fri, Aug 14, 2015 at 6:34 AM, Flavio Junqueira (JIRA) 



> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-14 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697907#comment-14697907
 ] 

Guozhang Wang commented on KAFKA-1387:
--

Thanks [~fpj], this is very helpful.

Just to add some more context regarding this issue, we have seen issues when 
ephemeral nodes were not deleted when brokers / consumers try to re-register 
themselves in ZK upon a session timeout event (details can be found in 
KAFKA-992). We tried to fix it via adding a registration timestamp into the 
registration node's data, and checking if the timestamp is different upon 
seeing it, and if not backing off to wait for this node to be removed.

However people have been also reporting a couple of times that the backing-off 
is never ending, i.e. the node has a different timestamp, but was never 
deleted. The suspicion was that there were multiple consequent session creation 
at a very short period of time, and the node with a different timestamp is 
created by a session that was not actually expired, and hence will never be 
gone. But no one has validated if this is the case though.

The logic of re-registration can be found in ZookeeperConsumerConnector.scala 
and KafkaHealthcheck.scala.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-17 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699514#comment-14699514
 ] 

Flavio Junqueira commented on KAFKA-1387:
-

There are two problems at a high level described here: zk losing ephemerals and 
ephemerals not going away. I haven't been able to reproduce the former, but 
I've been able to find one potential problem that could be causing it.

I started by finding suspicious that the ZK listeners were not dealing with 
session events at all:

{code}
def handleStateChanged(state: KeeperState) {
  // do nothing, since zkclient will do reconnect for us.
}
{code}

 It is quite typical with ZK that you wait for the connected event before 
making progress. Looking at the ZkClient implementation, I realized that it 
retries operations in the case of connection loss or session expiration until 
they go through. There is a race here, though. Say you submit a create, but 
instead of getting OK as a response, you get connection loss. ZkClient in this 
case will say "well, need to retry" and will get a node exists exception, which 
the code currently treats as a znode from a previous session. This znode will 
never go away because it belongs to the current session!

Now let's say we get rid of such corner cases. It is still possible that when 
the client recovers it finds a znode from a previous session. It can happen 
because the lease (session) corresponding to the znode is still valid, so ZK 
can't get rid of it. Revoking leases in general is a bit complicated, but it 
sounds ok in this case if there is no risky of having multiple incarnations of 
the same element (a broker) running concurrently.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-17 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700113#comment-14700113
 ] 

Guozhang Wang commented on KAFKA-1387:
--

I thought that when the previous session has ended (e.g. expired), its 
ephemeral node will be "eventually" removed? Does ZooKeeper itself have a 
leasing mechanism?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-17 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700187#comment-14700187
 ] 

Flavio Junqueira commented on KAFKA-1387:
-

bq. I thought that when the previous session has ended (e.g. expired), its 
ephemeral node will be "eventually" removed?

If the session ends cleanly, by the client submitting a closeSession request, 
then the session closes and the ephemerals are deleted with the request. But, 
if the client crashes and the server simply stops hearing from the client, then 
the session has to time out and expire so it takes some time.

bq. Does ZooKeeper itself have a leasing mechanism?

I'm referring to the fact that the ephemeral represents a lease that is revoked 
when the session times out.

I'm not sure if this is clear, but one of the problems I'm pointing out is that 
zkclient might end up creating the ephemeral znode in your *current* session. 
In this case, the znode won't go away. Here is actually another problem I found 
along the same lines. The createEphemeral call in ZkClient ends up calling 
retryUntilConnected, which retries even when the session expires:

{code}
try {
return callable.call();
} catch (ConnectionLossException e) {
// we give the event thread some time to update the status to 
'Disconnected'
Thread.yield();
waitForRetry();
} catch (SessionExpiredException e) {
// we give the event thread some time to update the status to 
'Expired'
Thread.yield();
waitForRetry();
}
{code}

In this case, say that one call to createEphemeral via handleNewSession happens 
during a given session, but the session expires before the operation goes 
through. The client will retry with the new session. When the consumer tries 
again, it will fail because the znode is there and won't go away. This is 
another case in which the znode won't go away because it has been created in 
the current session.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-17 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700483#comment-14700483
 ] 

Guozhang Wang commented on KAFKA-1387:
--

[~fpj] That makes sense. So it seems the right resolution should be at the 
ZkClient layer, not on Kafka's layer?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-18 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701418#comment-14701418
 ] 

Flavio Junqueira commented on KAFKA-1387:
-

It doesn't look like it 'd be a small change to zkclient to fix this. We 
essentially need it to expose zk events as they occur. In the way it currently 
does it, the events are serialized and the operations are retried transparently 
so I don't know if the znode already exists because of a connection loss or if 
the session actually expired and there is a new one now. 

The simplest way around this seems to be to just re-register the consumer 
directly (delete and create) upon a node exists exception. This should work 
because of the following argument.

There are three possibilities when we get a node exists exception:

# The znode exists from a previous session and hasn't been reclaimed yet
# The znode exists because of a connection loss event while the znode was being 
created, so the second time we get an exception (event)
# The previous session has expired, a new one was created, and the registration 
was occurring around this transition, so when we execute handleNewSession for 
the new session, we get a node exists exception. 

In all these three cases, deleting and recreating seems fine. It is clearly 
conservative and more expensive than necessary, but at least it doesn't require 
changes to zkclient. Does it sound a reasonable? Do you see any problem? 

CC [~guozhang] [~jwl...@gmail.com]

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-18 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701509#comment-14701509
 ] 

Guozhang Wang commented on KAFKA-1387:
--

Thanks [~fpj], that makes sense to me. [~jwlent55] do you want to submit a new 
patch following this approach?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-18 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701571#comment-14701571
 ] 

Flavio Junqueira commented on KAFKA-1387:
-

[~guozhang] it looks like [~jwl...@gmail.com] isn't in the list of 
contributors, could you add him so that we can assign the jira to him?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-19 Thread James Lent (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702992#comment-14702992
 ] 

James Lent commented on KAFKA-1387:
---

Your approach sounds much simpler than mine (which I like).  Similar to what I 
proposed doing only at startup (ensureNodeDoesNotExist method).  I am however 
not sure I understand the exact change you propose.  As I remember the 
createEphemeralPathExpectConflictHandleZKBug is called by three code paths:

- Register Broker
- Register Consumer
- Leadership election  

In my change I specifically tried avoid changing the Leadership election logic.

Is your change basically to implement your new logic (delete if already exists) 
instead of calling createEphemeralPathExpectConflictHandleZKBug in the first 
two cases?  If so I agree it sounds reasonable.  I suppose in a 
misconfiguration case two nodes might get into a registration war over the 
Broker node, but, that could (perhaps) be handled at startup (second one fails 
to start up).

If your propose replacing the createEphemeralPathExpectConflictHandleZKBug for 
the Leadership election case too then I am less comfortable making (and 
testing) that change.  I have never really dug into that logic too much.

One other factor to consider is that I am a bit backed up a work right now and 
this will not be issue will not be my highest priority.


> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-19 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703532#comment-14703532
 ] 

Guozhang Wang commented on KAFKA-1387:
--

[~jwlent55] I agree that this fix may be just for broker / consumer 
registration, i.e. ZK should not be used to detect mis-configuration that two 
brokers / clients use the same Id. Hence for that case, in the new approach 
they may end-up in a delete-and-write war. We should consider fixing such 
mis-operation in a different manner which is orthogonal to this JIRA. For 
leader election, one should not simply delete the path upon conflict, we should 
leave it as is. In the future, we should either fix the root cause in ZkClient 
or move on to use a different client as KIP-30 is current discussing about.

If you do not have time this week and feel it is OK, [~fpj] could you help 
taking it over?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-19 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703767#comment-14703767
 ] 

Flavio Junqueira commented on KAFKA-1387:
-

I'm indeed proposing to get rid of 
createEphemeralPathExpectConflictHandleZKBug. I can investigate the impact to 
leadership election.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-25 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711838#comment-14711838
 ] 

Guozhang Wang commented on KAFKA-1387:
--

Thanks [~fpj], thanks for the patch. Here are some high-level comments:

1. Will the mixing usage of ZK directly and ZkClient together violate ordering? 
AFAIK ZkClient orders all events fired by watchers and hand them to the user 
callbacks one-by-one, if we use ZK's Watcher directly will its callback be 
called out-of-order with other events?

2. If we get a Code.OK in CreateCallback, do we still need to trigger a 
ZooKeeper.exist with ExistsCallback again?

3. For the consumer / server registration case particularly, we tries to handle 
parent path creation in ZkUtils.makeSurePersistentPathExists, so I feel we 
should expose the problem that parent path does not exist yet instead trying to 
hide it in createRecursive.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Assignee: Flavio Junqueira
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: KAFKA-1387.patch, kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-08-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721137#comment-14721137
 ] 

ASF GitHub Bot commented on KAFKA-1387:
---

GitHub user fpj opened a pull request:

https://github.com/apache/kafka/pull/178

KAFKA-1387: Kafka getting stuck creating ephemeral node it has already 
created when two zookeeper sessions are established in a very short period of 
time

This is a patch to get around the problem discussed in the KAFKA-1387 jira. 
The tests are not passing in my box when I run them all, but they do pass when 
I run them individually, which indicates that there is something leaking from a 
test to the next. I still need to work this out and also work on further 
testing this. I wanted to open this PR now so that it can start getting 
reviewed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fpj/kafka 1387

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/178.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #178


commit f8be8657e649d0490e9ed1f1ef52234b3c31435e
Author: flavio junqueira 
Date:   2015-08-23T13:55:11Z

KAFKA-1387: First cut, node dependency on curator

commit b8f901b6478d4ac9c961e899d702e6fc11cfee07
Author: flavio junqueira 
Date:   2015-08-23T13:55:11Z

KAFKA-1387: First cut, node dependency on curator

commit 2369e66921f88b2ee1b24ddeff2bf2d050015447
Author: flavio junqueira 
Date:   2015-08-23T14:07:41Z

Merge branch '1387' of https://github.com/fpj/kafka into 1387

commit f03c301d5d919d9c05c6837de508b4f383906fdb
Author: flavio junqueira 
Date:   2015-08-23T13:55:11Z

KAFKA-1387: First cut, node dependency on curator

commit d8eab9e0f569eaaecb4afda4d486d00600ad1e6f
Author: flavio junqueira 
Date:   2015-08-24T14:56:01Z

KAFKA-1387: Some polishing

commit b7cbe5dbecbc28a564b99209114f39db785c73dd
Author: flavio junqueira 
Date:   2015-08-24T15:50:58Z

KAFKA-1387: Style fixes

commit 336f67c641c44b73ac1dbb66cdde4ff97f2fcd9a
Author: flavio junqueira 
Date:   2015-08-24T15:53:18Z

KAFKA-1387: More style fixes

commit 201ab2dcc33ba10a19c51f7452ce40497d3fcf83
Author: flavio junqueira 
Date:   2015-08-24T15:59:32Z

Merge branch '1387' of https://github.com/fpj/kafka into 1387

commit 9961665230e04331f7767d8aa8aaac0a14f46cd8
Author: flavio junqueira 
Date:   2015-08-23T13:55:11Z

KAFKA-1387: First cut, node dependency on curator

commit b52c12422f7a831137d8659f14779eaad1972217
Author: flavio junqueira 
Date:   2015-08-24T14:56:01Z

KAFKA-1387: Some polishing

commit b2400a0a37555250d50b1f1abfdda2c4d00b03ac
Author: flavio junqueira 
Date:   2015-08-24T15:50:58Z

KAFKA-1387: Style fixes

commit 888f6e0cf17d6a3a8d6b8dd46f8099731ba36511
Author: flavio junqueira 
Date:   2015-08-24T15:53:18Z

KAFKA-1387: More style fixes

commit d675b024b0e8627c4c2c9c113c07527851e81f7a
Author: flavio junqueira 
Date:   2015-08-29T15:00:07Z

KAFKA-1387

commit 4c83ac2609ed29a0f1887bf5087dab50e3e93488
Author: flavio junqueira 
Date:   2015-08-29T15:07:23Z

KAFKA-1387: Removing whitespaces.

commit 240b51a77715c53db784d5932702318ff28468c2
Author: flavio junqueira 
Date:   2015-08-29T15:11:30Z

Merge branch '1387' of https://github.com/fpj/kafka into 1387




> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Assignee: Flavio Junqueira
>Priority: Blocker
>  Labels: newbie, patch, zkclient-problems
> Attachments: KAFKA-1387.patch, kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> 

[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-09-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903437#comment-14903437
 ] 

Flavio Junqueira commented on KAFKA-1387:
-

hey [~guozhang]

bq. Will the mixing usage of ZK directly and ZkClient together violate 
ordering? AFAIK ZkClient orders all events fired by watchers and hand them to 
the user callbacks one-by-one, if we use ZK's Watcher directly will its 
callback be called out-of-order with other events?

ZkClient indeed handles the processing to a separate thread. To avoid blocking 
the dispatcher thread, it uses a separate thread to deliver events. This can be 
a problem if the events here and events handled directly by ZkClient are 
correlated. I tried to confine the ZK processing for this feature in the same 
class to avoid ordering issues. I don't see a problem concretely, but if you 
do, let me know. Right now it sounds like you're just speculating that it could 
be a problem, yes?

bq. If we get a Code.OK in CreateCallback, do we still need to trigger a 
ZooKeeper.exist with ExistsCallback again?

Right, that exists call is to set a watch.

bq. For the consumer / server registration case particularly, we tries to 
handle parent path creation in ZkUtils.makeSurePersistentPathExists, so I feel 
we should expose the problem that parent path does not exist yet instead trying 
to hide it in createRecursive.

I've commented on the PR about this. What's your specific concern here? If you 
could elaborate a bit more, I'd appreciate.  

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Assignee: Flavio Junqueira
>Priority: Critical
>  Labels: newbie, patch, zkclient-problems
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-1387.patch, kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2015-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906672#comment-14906672
 ] 

ASF GitHub Bot commented on KAFKA-1387:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/178


> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>Assignee: Flavio Junqueira
>Priority: Critical
>  Labels: newbie, patch, zkclient-problems
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-1387.patch, kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-27 Thread James Lent (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150883#comment-14150883
 ] 

James Lent commented on KAFKA-1387:
---

I have seen this issue in our QA environment (3 ZooKeeper, 3 Kafka and several 
application specific nodes) several times now.  The problem is triggered when 
the system is under stress (high I/O and CPU load) and the ZooKeeper 
connections become unstable.  When this happens Kafka threads can get stuck 
trying to register Brokers nodes and Application threads get stuck trying to 
register Consumer nodes. One way to recover is to restart the impacted nodes.  
As an experiment I aslo tried deleting the blocking ZooKeeper nodes (hours 
later when the system was under no stress).  When I did so the 
createEphemeralPathExpectConflictHandleZKBug would rocess one expire, break out 
of its loop, but, then immediately reenter it whenit tired to process the next 
expire message.  The few times I tested this approach I had to delete the node 
dozens of times before the problem would clear itself - in other words there 
were dozens of Expire messages wating to be processed. Obvoisuly I am looking 
into this issue from a configuration point of view (avoid the unstable 
connection issue), but, this Kafka error behavior concerns me.

I have reproduced it (somewhat artificially) in a dev environment as follows:

1) Start one ZooKeeper and on Kafka node.
2) Set a thread breakpoint in KafkaHealthCheck.java. 
def handleNewSession() {
  info("re-registering broker info in ZK for broker " + brokerId)
-->   register()
  info("done re-registering broker")
  info("Subscribing to %s path to watch for new
topics".format(ZkUtils.BrokerTopicsPath))
}
3) Pause Kafka.
4) Wait for ZooKeeper to expire the first session and drop the ephemeral node.
5) Unpause Kafka.
6) Kafka reconnects with ZooKeeper, receives an Expire, and establishes a
second session.
7) Breakpoint hit and event thread paused before handling the first Expire.
8) Pause Kafka again.
9) Wait for ZooKeeper to expire the second session and delete the ephemeral 
node (again).
10) Remove breakpoint, unpause Kafka, and finally release the event thread.
11) Kafka reconnects with ZooKeeper, receives a second Expire, and establishes 
a third session.
12) Kafka registers an ephemeral triggered by the first expire (which triggerd 
the second session), but, ZooKeeper associates it with the third Session. 
13) Kafka tries to register an an ephemeral triggered by the second expire, 
but, ZooKeeper already has a stable node.
14) Kafka assumes this node will go away soon, sleeps, and then retries.
15) The node is associcated with a valid session and threfore does not go away 
so Kafka remains stuck in the retry loop.

I have tested this with the latest code in trunk and noted the same behavior 
(the code looks pretty similar).

I have coded up a potential 0.8.1.1 patch for this issue based on the following 
principles:

1) Ensure that when the node starts stale nodes are removed in main
- For Brokers this means remove nodes with the same host name and port 
otherwise fail to start (the existing checker logic)
- For Consumer nodes don't worry about stale nodes - the way they are named 
should prevent this from ever happening.
2) In main add the initial node which should now always work with no looping 
required - direct call to createEphemeralPath
3) Create a EphemeralNodeMonitor class that contains:
- IZkDataListener
- IZkStateListener
4) The users of this class provide a path to monitor and in a closure that 
defines what to do when the node is not found
5) When the state listener is notifed about a new session it checks to see if 
the node is already gone:
- Yes, call the provided function
- No, ignore the event
6) When the data listener is notified of a deletion it does the same thing
7) Both the Broker and Comsumer registation use this new class in the same way 
they curently use their individual state listeners.  There only change in 
behavior is to call createEphemeralPath directly (and avoid the looping code).

Since all this work should be done in the event thread I don't think there are 
any race conditions and no other nodes should be adding these nodes (or we have 
a serious configuration issue that should have been detected at startup).

One assumption is that we will always recieve at least one more event (expire 
and/or delete) after the node is really deleted by ZooKeeper.  I think that is 
a valid assumption (ZooKeeper can't send the delete until the node is gone).  I 
wonder if we could just get away with monitoring node deletions, but, that 
seems risky.  The only change in behavior should be that if the expire is 
recieved before the node is actually deleted then event loop is not blocked and 
could process other messages while waiting for the delete event.

Note: I have not touched the leader election /

[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-28 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151260#comment-14151260
 ] 

Jun Rao commented on KAFKA-1387:


James,

Thanks for reporting this. Yes, what you discovered is a real problem. The fix 
is going to be tricky though. The issue is the following. When a client lose an 
ephemeral node in ZK due to session expiration, that ephemeral node is not 
removed exactly at expiration time, but a short time after (ZOOKEEPER-1740). 
When the client tries to recreate the ephemeral node and get a 
NodeExistException, one of the two things could happen: (1) the existing node 
is from the expired session and is on its way to be deleted, (2) the node is 
actually created on the latest session (The reason is what you discovered:  the 
client gets multiple handleNewSession() calls due to multiple session 
expiration events, but the node is created on the latest session). I am not 
sure if there is an easy way to distinguish the two cases though.

Overall, it seems to me that there are so many corner cases that one has to 
deal with during ZK session expiration. The simplest approach is probably to 
prevent session expiration from happening at all (e.g., set a larger session 
timeout).

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-28 Thread Gwen Shapira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151264#comment-14151264
 ] 

Gwen Shapira commented on KAFKA-1387:
-

AFAIK the ZK bug was never reproduced in newer versions of ZK. I'm wondering if 
at some point we can say that ZK 3.3 is no longer supported and remove the 
work-around (which is creating few issues of its own).


> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-28 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151266#comment-14151266
 ] 

Jun Rao commented on KAFKA-1387:


Gwen,

>From ZOOKEEPER-1809, it seems the design of not deleting ephemeral node 
>immediately on session expiration still exists on ZK 3.4.x and beyond?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-28 Thread Gwen Shapira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151275#comment-14151275
 ] 

Gwen Shapira commented on KAFKA-1387:
-

ZOOKEEPER-1809 was closed because the re-creation of the issue was buggy (the 
test app was actually creating two sessions at same time). 

I agree that Flavio indicated that ZNodes can hang around after expiration, but 
he also indicated the opposite in the email thread for ZOOKEEPER-1740.

Its important to get this right, so I'll do more research on the expected 
ZooKeeper behavior here.

One thing I'm not sure about is why does 
createEphemeralPathExpectConflictHandleZKBug loop indefinitely? 
If ZK indeed takes a bit of extra time to clean up, we can loop for specific 
amount of time (number of retries), like Curator typically does. After few 
seconds, the probability that the ZNode belongs to an active session and not an 
expired one is very high.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-29 Thread James Lent (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151663#comment-14151663
 ] 

James Lent commented on KAFKA-1387:
---

As background we are using ZooKeeper 3.4.5.

When trying to come up with a fix for this I did consider limiting the loop to 
2 to 3 tries.  My concerns with this approach were:

# Slow to recover if there are lots of Expire messages tp process and each of 
these could trigger redundant rebalance events until you get to the last one.
# What happens if you don't loop quite long enough?  You are again stuck in a 
bad state when the ephemeral does go away.

I also considered trying to access the Session Id and storing that value 
instead of (or in addition to) the timestamp in the node's data.  That appraoch 
looked difficult to implement, error prone, and had the application doing what 
I would consider ZooKeeper work.

I agree there are a lot of corner cases to consider, but, I think we are going 
to pursue the approach I outlined above.  I would be happy to post the proposed 
solution for your review, but, again I am not sure about the protocol around 
patch submission.  I would not want this to be mistaken by someone as any kind 
of offical patch without a lot more review.

When working on this appraoch I looked at the curator PersistentEphemeralNode 
for ideas:

https://github.com/bazaarvoice/curator-extensions/blob/master/recipes/src/main/java/com/bazaarvoice/curator/recipes/PersistentEphemeralNode.java

This is curator based so done not directly apply to Kafka (yet), but, it also 
keys off nodeDelete to restore the node.

In the end I went with the simple idea that:

"If when we process an Expire event the node still exists then ZooKeeper will 
inform us if that node later goes away."

If we can't trust ZooKeeper/ZkClient to do that then ...

{noformat}
  class StateListener() extends IZkStateListener {

def handleStateChanged(state: KeeperState) {}

def handleNewSession() {
  if (zkClient.exists(path)) {
info("New session started, but, ephemeral %s already/still 
exists".format(path))
  }
  else {
info("New session started, recreate ephemeral node %s".format(path))
recreateNode()
  }
}
  }
{noformat}

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-29 Thread James Lent (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151673#comment-14151673
 ] 

James Lent commented on KAFKA-1387:
---

In case anyone is interested in the complete code for the new class I am 
testing with:

{noformat}
class EphemeralNodeMonitor(zkClient: ZkClient, path: String, recreateNode: () 
=> Unit) extends Logging {

  val dataListener = new DataListener
  val stateListener = new StateListener
  
  def start() {
zkClient.subscribeStateChanges(stateListener)
zkClient.subscribeDataChanges(path, dataListener)
  }
  
  def close() {
zkClient.unsubscribeStateChanges(stateListener)
zkClient.unsubscribeDataChanges(path, dataListener)
  }

  class DataListener extends IZkDataListener {

var oldData: String = null

def handleDataChange(dataPath: String, newData: scala.Any) {
  if (!newData.toString.equals(oldData)) {
oldData = newData.toString
info("Ephemeral node %s has new data [%s]".format(dataPath, newData))
  }
}

def handleDataDeleted(dataPath: String) {
  if (zkClient.exists(path)) {
info("Ephemeral node %s was deleted, but, has already been 
recreated".format(dataPath))
  }
  else {
info("Ephemeral node %s was deleted, recreate it".format(dataPath))
recreateNode()
  }
}
  }

  class StateListener() extends IZkStateListener {

def handleStateChanged(state: KeeperState) {}

def handleNewSession() {
  if (zkClient.exists(path)) {
info("New session started, but, ephemeral %s already/still 
exists".format(path))
  }
  else {
info("New session started, recreate ephemeral node %s".format(path))
recreateNode()
  }
}
  }
{noformat}

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-29 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151873#comment-14151873
 ] 

Jun Rao commented on KAFKA-1387:


James,

Contributing code to Kafka is pretty simple. You just need to attach a patch to 
the jira.

As for your solution, we probably need to verify the following: will a watcher 
fire if it's registered on a path created by an already expired session and the 
path will be deleted soon.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-29 Thread James Lent (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152337#comment-14152337
 ] 

James Lent commented on KAFKA-1387:
---

I aplogize in advance for my ignorance, but, I have one newbie question.  My 
starting point is the 0.8.1.1 tag (really the 0.8.1.1 source distribution).  
Would it be OK for me to submit a patch against that baseline or would it be 
better for me to first merge the code to trunk and then create the patch?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-29 Thread James Lent (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152453#comment-14152453
 ] 

James Lent commented on KAFKA-1387:
---

As for your question (which I agree is one of the key questions) I have the 
following comments:

* The ZooKeeper documentation states there is one case where a watch may be 
missed which I do not think applies to the situation I am trying to address:

"Watches are maintained locally at the ZooKeeper server to which the client is 
connected. This allows watches to be lightweight to set, maintain, and 
dispatch. When a client connects to a new server, the watch will be triggered 
for any session events. Watches will not be received while disconnected from a 
server. When a client reconnects, any previously registered watches will be 
reregistered and triggered if needed. In general this all occurs transparently. 
There is one case where a watch may be missed: a watch for the existence of a 
znode not yet created will be missed if the znode is created and deleted while 
disconnected."

* In my testing the node is normally gone by the time the New Session event is 
handled which recreates the node. In that case I do not see a Delete message (I 
log that arrival of a delete event even if the node is already gone):

{noformat}
[2014-09-29 18:23:43,071] INFO zookeeper state changed (Expired) 
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:23:43,071] INFO Unable to reconnect to ZooKeeper service, 
session 0x148c36a0a94000f has expired, closing socket connection 
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,071] INFO Initiating client connection, 
connectString=localhost:2181/kafka/0.8 sessionTimeout=6000 
watcher=org.I0Itec.zkclient.ZkClient@56404645 (org.apache.zookeeper.ZooKeeper)
[2014-09-29 18:23:43,072] INFO Opening socket connection to server 
localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,073] INFO Socket connection established to 
localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,074] INFO EventThread shut down 
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,082] INFO Closing socket connection to /10.210.10.165. 
(kafka.network.Processor)
[2014-09-29 18:23:43,087] INFO Session establishment complete on server 
localhost/127.0.0.1:2181, sessionid = 0x148c36a0a940010, negotiated timeout = 
6000 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,087] INFO zookeeper state changed (SyncConnected) 
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:23:43,099] INFO 0 successfully elected as leader 
(kafka.server.ZookeeperLeaderElector)
[2014-09-29 18:23:43,143] INFO New session started, recreate ephemeral node 
/brokers/ids/0 (kafka.utils.EphemeralNodeMonitor)
[2014-09-29 18:23:43,144] INFO Start registering broker 0 in ZooKeeper 
(kafka.server.KafkaHealthcheck)
[2014-09-29 18:23:43,161] INFO Registered broker 0 at path /brokers/ids/0 with 
address jlent.digitalsmiths.com:9092. (kafka.utils.ZkUtils$)
[2014-09-29 18:23:43,218] INFO Ephemeral node /brokers/ids/0 has new data 
[{"jmx_port":10001,"timestamp":"1412029423148","host":"jlent.digitalsmiths.com","version":1,"port":9092}]
 (kafka.utils.EphemeralNodeMonitor)
[2014-09-29 18:23:43,237] INFO New leader is 0 
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
{noformat}

* I have seen cases where the node is still present when the New Session is 
handled and in that case I do see a Delete event a short while later.  I don't 
have the logs that document that (don't ask me why I don't have logs to 
document the most important scenario).  I will try to recreate that situation.
* As an alternative I modified the New Session handling code to do nothing 
(except log the arrival of the new session event).  In that case I do see the 
Delete event.  This could perhaps be viewed a more severe test.  In this case 
we get notified of a Delete that actually occured before we even handled the 
New Seesion event.  That was actually how I did some of my original testing.

{noformat}
[2014-09-29 18:14:31,414] INFO zookeeper state changed (Expired) 
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:14:31,414] INFO Unable to reconnect to ZooKeeper service, 
session 0x148c36a0a94000c has expired, closing socket connection 
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,414] INFO Initiating client connection, 
connectString=localhost:2181/kafka/0.8 sessionTimeout=6000 
watcher=org.I0Itec.zkclient.ZkClient@15c58840 (org.apache.zookeeper.ZooKeeper)
[2014-09-29 18:14:31,414] INFO Opening socket connection to server 
localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,415] INFO EventThread shut down 
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,415] INFO Socket connection established to 
localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,420] INFO 

[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-09-30 Thread James Lent (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153370#comment-14153370
 ] 

James Lent commented on KAFKA-1387:
---

I have messed things up.  I tried to use the Submit Patch option.  I filled out 
the fields in the form, but, it never asked me for a file.  I also specifed 
labels that I assumed were related to the patch, but, instead are associated 
with the issue itself.  I then directly attached the file to the issue.  That 
seemed to go OK.  Now the Submit Patch option is gone and the Status is Patch 
Available.  I don't think that is correct.  I decided it is best if I stop 
messing with the issue for now.  I have done enough damage.

I apologize for my ignorance of the process.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>  Labels: newbie, patch
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-10-02 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156635#comment-14156635
 ] 

Jun Rao commented on KAFKA-1387:


James,

For my question, could you ask the ZK mailing list and get your understanding 
confirmed by their developers? Thanks,

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>  Labels: newbie, patch
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-10-02 Thread James Lent (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156746#comment-14156746
 ] 

James Lent commented on KAFKA-1387:
---

Good idea and done:

http://mail-archives.apache.org/mod_mbox/zookeeper-user/201410.mbox/browser

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1.1
>Reporter: Fedor Korotkiy
>  Labels: newbie, patch
> Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-04-11 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1394#comment-1394
 ] 

Guozhang Wang commented on KAFKA-1387:
--

Hi Fedor, do you think this is caused by the same issue described in 
https://issues.apache.org/jira/browse/KAFKA-1382 ?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-04-13 Thread Fedor Korotkiy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967841#comment-13967841
 ] 

Fedor Korotkiy commented on KAFKA-1387:
---

I think it's a different issue.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-04-13 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967928#comment-13967928
 ] 

Guozhang Wang commented on KAFKA-1387:
--

I think the main issue here is when there is a zookeeper session timeout, the 
zkClient will re-try write the data which could be already committed to ZK and 
failed. This issue is the same as the one causing KAFKA-1382. But I think their 
fixes would be different.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-08-04 Thread Joe Stein (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085792#comment-14085792
 ] 

Joe Stein commented on KAFKA-1387:
--

Here is another way to reproduce this issue.  I have seen it a few times now 
with folks getting going with their clusters.

steps to reproduce.  install a 3 node zk ensemble with 3 brokers cluster

e.g. 

git clone https://github.com/stealthly/scala-kafka
git checkout -b zkbk3 origin/zkbk3
vagrant up provider=virtualbox

now setup each node in the cluster as you would broker 1,2,3 and the ensemble

e.g.

vagrant ssh zkbkOne
sudo su
cd /vagrant/vagrant/ && ./up.sh
vagrant ssh zkbkTwo
sudo su
cd /vagrant/vagrant/ && ./up.sh
vagrant ssh zkbkThree
sudo su
cd /vagrant/vagrant/ && ./up.sh

start up zookeeper on all 3 nodes
cd /opt/apache/kafka && bin/zookeeper-server-start.sh 
config/zookeeper.properties 1>>/tmp/zk.log 2>>/tmp/zk.log &

now, start up broker on node 2 only
cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 
1>>/tmp/bk.log 2>>/tmp/bk.log &

ok, now here is where it gets wonky

- change the broker.id int server 3 to = 2 
now you need to start up server 1 and 3 (even though it is 2) at the same time

cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 
1>>/tmp/bk.log 2>>/tmp/bk.log &
cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 
1>>/tmp/bk.log 2>>/tmp/bk.log &
( you can have two tabs, hit enter in one switch to other tab and hit enter is 
close enough to same time)

and you get this looping forever

2014-08-05 04:34:38,591] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
[2014-08-05 04:34:44,598] INFO conflict in /controller data: 
{"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: 
{"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$)
[2014-08-05 04:34:44,601] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
[2014-08-05 04:34:50,610] INFO conflict in /controller data: 
{"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: 
{"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$)
[2014-08-05 04:34:50,614] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
[2014-08-05 04:34:56,621] INFO conflict in /controller data: 
{"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: 
{"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$)

the expected result that you get should be

[2014-08-05 04:07:20,917] INFO conflict in /brokers/ids/2 data: 
{"jmx_port":-1,"timestamp":"1407211640900","host":"192.168.30.3","version":1,"port":9092}
 stored data: {"jmx_port":-1,"timestamp":"140721119
9464","host":"192.168.30.2","version":1,"port":9092} (kafka.utils.ZkUtils$)
[2014-08-05 04:07:20,949] FATAL Fatal error during KafkaServerStable startup. 
Prepare to shutdown (kafka.server.KafkaServerStartable)
java.lang.RuntimeException: A broker is already registered on the path 
/brokers/ids/2. This probably indicates that you either have configured a 
brokerid that is already in use, or else you have shutdown 
this broker and restarted it faster than the zookeeper timeout so it appears to 
be re-registering.
at kafka.utils.ZkUtils$.registerBrokerInZk(ZkUtils.scala:205)
at kafka.server.KafkaHealthcheck.register(KafkaHealthcheck.scala:57)
at kafka.server.KafkaHealthcheck.startup(KafkaHealthcheck.scala:44)
at kafka.server.KafkaServer.startup(KafkaServer.scala:103)
at 
kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:34)
at kafka.Kafka$.main(Kafka.scala:46)
at kafka.Kafka.main(Kafka.scala)
[2014-08-05 04:07:20,952] INFO [Kafka Server 2], shutting down 
(kafka.server.KafkaServer)
[2014-08-05 04:07:20,954] INFO [Socket Server on Broker 2], Shutting down 
(kafka.network.SocketServer)
[2014-08-05 04:07:20,959] INFO [Socket Server on Broker 2], Shutdown completed 
(kafka.network.SocketServer)
[2014-08-05 04:07:20,960] INFO [Kafka Request Handler on Broker 2], shutting 
down (kafka.server.KafkaRequestHandlerPool)
[2014-08-05 04:07:20,992] INFO [Kafka Request Handler on Broker 2], shut down 
completely (kafka.server.KafkaRequestHandlerPool)
[2014-08-05 04:07:21,263] INFO [Replica Manager on Broker 2]: Shut down 
(kafka.server.ReplicaManager)
[

[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-08-05 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086398#comment-14086398
 ] 

Jun Rao commented on KAFKA-1387:


Joe,

The issue that you described is probably fixed in KAFKA-1451?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-08-05 Thread Joe Stein (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087063#comment-14087063
 ] 

Joe Stein commented on KAFKA-1387:
--

[~junrao] I tested on trunk and it is much worse now.

instead of looping on the /controller node (like it was before) ... node 3 
actually overwrote/stole the /brokers/ids/2 (doing a get before had it as 
192.168.30.1 and after it is 192.168.30.1)

so now i have a situation where I have two broker servers, each with the same 
broker id running, node 3 is the broker with all the topics being created on it 
and failing requests for producing and consuming (because all the data is on 
node 1 but that is not advertised) and node 1 is still the controller.






> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-08-06 Thread Gwen Shapira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088474#comment-14088474
 ] 

Gwen Shapira commented on KAFKA-1387:
-

Attempted to reproduce with trunk as well.

I'm not seeing the same behavior as [~joestein]. 
In my experiment the new broker 2 fails with the correct error message. The old 
broker 2, OTOH, goes into a loop, printing:
"[2014-08-06 16:37:01,884] INFO Partition [test1,0] on broker 2: Cached 
zkVersion [89] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)"

Not a good behavior either. 

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

2014-08-10 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092255#comment-14092255
 ] 

Jun Rao commented on KAFKA-1387:


Hmm, this seems really weird. Not sure why starting two brokers at the same 
time will affect the ZK registration. Is this reproducible by running multiple 
brokers on the same machine?

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> -
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
>  Issue Type: Bug
>Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.2#6252)