[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699514#comment-14699514
 ] 

Flavio Junqueira commented on KAFKA-1387:
-----------------------------------------

There are two problems at a high level described here: zk losing ephemerals and 
ephemerals not going away. I haven't been able to reproduce the former, but 
I've been able to find one potential problem that could be causing it.

I started by finding suspicious that the ZK listeners were not dealing with 
session events at all:

{code}
def handleStateChanged(state: KeeperState) {
      // do nothing, since zkclient will do reconnect for us.
}
{code}

 It is quite typical with ZK that you wait for the connected event before 
making progress. Looking at the ZkClient implementation, I realized that it 
retries operations in the case of connection loss or session expiration until 
they go through. There is a race here, though. Say you submit a create, but 
instead of getting OK as a response, you get connection loss. ZkClient in this 
case will say "well, need to retry" and will get a node exists exception, which 
the code currently treats as a znode from a previous session. This znode will 
never go away because it belongs to the current session!

Now let's say we get rid of such corner cases. It is still possible that when 
the client recovers it finds a znode from a previous session. It can happen 
because the lease (session) corresponding to the znode is still valid, so ZK 
can't get rid of it. Revoking leases in general is a bit complicated, but it 
sounds ok in this case if there is no risky of having multiple incarnations of 
the same element (a broker) running concurrently.

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1387
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1387
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.1.1
>            Reporter: Fedor Korotkiy
>            Priority: Blocker
>              Labels: newbie, patch, zkclient-problems
>         Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to