[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697907#comment-14697907
 ] 

Guozhang Wang commented on KAFKA-1387:
--------------------------------------

Thanks [~fpj], this is very helpful.

Just to add some more context regarding this issue, we have seen issues when 
ephemeral nodes were not deleted when brokers / consumers try to re-register 
themselves in ZK upon a session timeout event (details can be found in 
KAFKA-992). We tried to fix it via adding a registration timestamp into the 
registration node's data, and checking if the timestamp is different upon 
seeing it, and if not backing off to wait for this node to be removed.

However people have been also reporting a couple of times that the backing-off 
is never ending, i.e. the node has a different timestamp, but was never 
deleted. The suspicion was that there were multiple consequent session creation 
at a very short period of time, and the node with a different timestamp is 
created by a session that was not actually expired, and hence will never be 
gone. But no one has validated if this is the case though.

The logic of re-registration can be found in ZookeeperConsumerConnector.scala 
and KafkaHealthcheck.scala.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1387
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1387
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.1.1
>            Reporter: Fedor Korotkiy
>            Priority: Blocker
>              Labels: newbie, patch, zkclient-problems
>         Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to