[ 
https://issues.apache.org/jira/browse/KAFKA-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated KAFKA-764:
-----------------------------
    Component/s: zkclient

> Race Condition in Broker Registration after ZooKeeper disconnect
> ----------------------------------------------------------------
>
>                 Key: KAFKA-764
>                 URL: https://issues.apache.org/jira/browse/KAFKA-764
>             Project: Kafka
>          Issue Type: Bug
>          Components: zkclient
>    Affects Versions: 0.7.1
>            Reporter: Bob Cotton
>            Priority: Major
>         Attachments: BPPF_2900-Broker_Logs.tbz2
>
>
> When running our ZooKeepers in VMware, occasionally all the keepers 
> simultaneously pause long enough for the Kafka clients to time out and then 
> the keepers simultaneously un-pause.
> When this happens, the zk clients disconnect from ZooKeeper. When ZooKeeper 
> comes back ZkUtils.createEphemeralPathExpectConflict discovers the node id of 
> itself and does not re-register the broker id node and the function call 
> succeeds. Then ZooKeeper figures out the broker disconnected from the keeper 
> and deletes the ephemeral node *after* allowing the consumer to read the data 
> in the /brokers/ids/x node.  The broker then goes on to register all the 
> topics, etc.  When consumers connect, they see topic nodes associated with 
> the broker but thy can't find the broker node to get connection information 
> for the broker, sending them into a rebalance loop until they reach 
> rebalance.retries.max and fail.
> This might also be a ZooKeeper issue, but the desired behavior for a 
> disconnect case might be, if the broker node is found to explicitly delete 
> and recreate it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to