[
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151260#comment-14151260
]
Jun Rao commented on KAFKA-1387:
--------------------------------
James,
Thanks for reporting this. Yes, what you discovered is a real problem. The fix
is going to be tricky though. The issue is the following. When a client lose an
ephemeral node in ZK due to session expiration, that ephemeral node is not
removed exactly at expiration time, but a short time after (ZOOKEEPER-1740).
When the client tries to recreate the ephemeral node and get a
NodeExistException, one of the two things could happen: (1) the existing node
is from the expired session and is on its way to be deleted, (2) the node is
actually created on the latest session (The reason is what you discovered: the
client gets multiple handleNewSession() calls due to multiple session
expiration events, but the node is created on the latest session). I am not
sure if there is an easy way to distinguish the two cases though.
Overall, it seems to me that there are so many corner cases that one has to
deal with during ZK session expiration. The simplest approach is probably to
prevent session expiration from happening at all (e.g., set a larger session
timeout).
> Kafka getting stuck creating ephemeral node it has already created when two
> zookeeper sessions are established in a very short period of time
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
> Issue Type: Bug
> Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession()
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using
> Ctrl-Z.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)