[
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152453#comment-14152453
]
James Lent commented on KAFKA-1387:
-----------------------------------
As for your question (which I agree is one of the key questions) I have the
following comments:
* The ZooKeeper documentation states there is one case where a watch may be
missed which I do not think applies to the situation I am trying to address:
"Watches are maintained locally at the ZooKeeper server to which the client is
connected. This allows watches to be lightweight to set, maintain, and
dispatch. When a client connects to a new server, the watch will be triggered
for any session events. Watches will not be received while disconnected from a
server. When a client reconnects, any previously registered watches will be
reregistered and triggered if needed. In general this all occurs transparently.
There is one case where a watch may be missed: a watch for the existence of a
znode not yet created will be missed if the znode is created and deleted while
disconnected."
* In my testing the node is normally gone by the time the New Session event is
handled which recreates the node. In that case I do not see a Delete message (I
log that arrival of a delete event even if the node is already gone):
{noformat}
[2014-09-29 18:23:43,071] INFO zookeeper state changed (Expired)
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:23:43,071] INFO Unable to reconnect to ZooKeeper service,
session 0x148c36a0a94000f has expired, closing socket connection
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,071] INFO Initiating client connection,
connectString=localhost:2181/kafka/0.8 sessionTimeout=6000
watcher=org.I0Itec.zkclient.ZkClient@56404645 (org.apache.zookeeper.ZooKeeper)
[2014-09-29 18:23:43,072] INFO Opening socket connection to server
localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,073] INFO Socket connection established to
localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,074] INFO EventThread shut down
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,082] INFO Closing socket connection to /10.210.10.165.
(kafka.network.Processor)
[2014-09-29 18:23:43,087] INFO Session establishment complete on server
localhost/127.0.0.1:2181, sessionid = 0x148c36a0a940010, negotiated timeout =
6000 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,087] INFO zookeeper state changed (SyncConnected)
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:23:43,099] INFO 0 successfully elected as leader
(kafka.server.ZookeeperLeaderElector)
[2014-09-29 18:23:43,143] INFO New session started, recreate ephemeral node
/brokers/ids/0 (kafka.utils.EphemeralNodeMonitor)
[2014-09-29 18:23:43,144] INFO Start registering broker 0 in ZooKeeper
(kafka.server.KafkaHealthcheck)
[2014-09-29 18:23:43,161] INFO Registered broker 0 at path /brokers/ids/0 with
address jlent.digitalsmiths.com:9092. (kafka.utils.ZkUtils$)
[2014-09-29 18:23:43,218] INFO Ephemeral node /brokers/ids/0 has new data
[{"jmx_port":10001,"timestamp":"1412029423148","host":"jlent.digitalsmiths.com","version":1,"port":9092}]
(kafka.utils.EphemeralNodeMonitor)
[2014-09-29 18:23:43,237] INFO New leader is 0
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
{noformat}
* I have seen cases where the node is still present when the New Session is
handled and in that case I do see a Delete event a short while later. I don't
have the logs that document that (don't ask me why I don't have logs to
document the most important scenario). I will try to recreate that situation.
* As an alternative I modified the New Session handling code to do nothing
(except log the arrival of the new session event). In that case I do see the
Delete event. This could perhaps be viewed a more severe test. In this case
we get notified of a Delete that actually occured before we even handled the
New Seesion event. That was actually how I did some of my original testing.
{noformat}
[2014-09-29 18:14:31,414] INFO zookeeper state changed (Expired)
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:14:31,414] INFO Unable to reconnect to ZooKeeper service,
session 0x148c36a0a94000c has expired, closing socket connection
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,414] INFO Initiating client connection,
connectString=localhost:2181/kafka/0.8 sessionTimeout=6000
watcher=org.I0Itec.zkclient.ZkClient@15c58840 (org.apache.zookeeper.ZooKeeper)
[2014-09-29 18:14:31,414] INFO Opening socket connection to server
localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,415] INFO EventThread shut down
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,415] INFO Socket connection established to
localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,420] INFO Closing socket connection to /10.210.10.165.
(kafka.network.Processor)
[2014-09-29 18:14:31,459] INFO Session establishment complete on server
localhost/127.0.0.1:2181, sessionid = 0x148c36a0a94000d, negotiated timeout =
6000 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,459] INFO zookeeper state changed (SyncConnected)
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:14:31,587] INFO 0 successfully elected as leader
(kafka.server.ZookeeperLeaderElector)
[2014-09-29 18:14:31,683] INFO New session started, DO NOT recreate ephemeral
node /brokers/ids/0 (kafka.utils.EphemeralNodeMonitor)
[2014-09-29 18:14:31,701] INFO Ephemeral node /brokers/ids/0 was deleted,
recreate it (kafka.utils.EphemeralNodeMonitor)
[2014-09-29 18:14:31,702] INFO Start registering broker 0 in ZooKeeper
(kafka.server.KafkaHealthcheck)
[2014-09-29 18:14:31,722] INFO Registered broker 0 at path /brokers/ids/0 with
address jlent.digitalsmiths.com:9092. (kafka.utils.ZkUtils$)
[2014-09-29 18:14:31,744] INFO New leader is 0
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
[2014-09-29 18:14:31,746] INFO Ephemeral node /brokers/ids/0 has new data
[{"jmx_port":10001,"timestamp":"1412028871704","host":"jlent.digitalsmiths.com","version":1,"port":9092}]
(kafka.utils.EphemeralNodeMonitor)
{noformat}
> Kafka getting stuck creating ephemeral node it has already created when two
> zookeeper sessions are established in a very short period of time
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-1387
> URL: https://issues.apache.org/jira/browse/KAFKA-1387
> Project: Kafka
> Issue Type: Bug
> Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession()
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using
> Ctrl-Z.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)