[ https://issues.apache.org/jira/browse/KAFKA-764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568727#comment-16568727 ]
Ray Chiang commented on KAFKA-764: ---------------------------------- This is for a very old version of Kafka. If I don't see any update in a week or ago, I'm going to close this JIRA. > Race Condition in Broker Registration after ZooKeeper disconnect > ---------------------------------------------------------------- > > Key: KAFKA-764 > URL: https://issues.apache.org/jira/browse/KAFKA-764 > Project: Kafka > Issue Type: Bug > Components: zkclient > Affects Versions: 0.7.1 > Reporter: Bob Cotton > Priority: Major > Attachments: BPPF_2900-Broker_Logs.tbz2 > > > When running our ZooKeepers in VMware, occasionally all the keepers > simultaneously pause long enough for the Kafka clients to time out and then > the keepers simultaneously un-pause. > When this happens, the zk clients disconnect from ZooKeeper. When ZooKeeper > comes back ZkUtils.createEphemeralPathExpectConflict discovers the node id of > itself and does not re-register the broker id node and the function call > succeeds. Then ZooKeeper figures out the broker disconnected from the keeper > and deletes the ephemeral node *after* allowing the consumer to read the data > in the /brokers/ids/x node. The broker then goes on to register all the > topics, etc. When consumers connect, they see topic nodes associated with > the broker but thy can't find the broker node to get connection information > for the broker, sending them into a rebalance loop until they reach > rebalance.retries.max and fail. > This might also be a ZooKeeper issue, but the desired behavior for a > disconnect case might be, if the broker node is found to explicitly delete > and recreate it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)