[
https://issues.apache.org/jira/browse/KAFKA-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ismael Juma resolved KAFKA-2193.
--------------------------------
Resolution: Duplicate
Duplicate of KAFKA-5473.
> Intermittent network + DNS issues can cause brokers to permanently drop out
> of a cluster
> ----------------------------------------------------------------------------------------
>
> Key: KAFKA-2193
> URL: https://issues.apache.org/jira/browse/KAFKA-2193
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8.1.1
> Reporter: Tom Lee
> Labels: broker
>
> Our Kafka cluster recently experienced some intermittent network & DNS
> resolution issues such that this call to connect to Zookeeper failed with an
> UnknownHostException:
> https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkConnection.java#L67
> We observed this happen during a processStateChanged(KeeperState.Expired)
> call:
> https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkClient.java#L649
> the session expiry was in turn caused by what we suspect to be intermittent
> network issues.
> The failed ZK reconnect seemed to put ZkClient into a state where it would
> never recover and the Kafka broker into a state where it would need a restart
> to rejoin the cluster: ZkConnection._zk == null, 0.3.x doesn't appear to
> automatically try to make further attempts to reconnect after the failure,
> and obviously no further state transitions seem likely to happen without a
> connection to ZK.
> The newer zkclient 0.4.0/0.5.0 releases will helpfully fire a notification
> when this occurs, so the brokers have an opportunity to handle this sort of
> failure in a more graceful manner (e.g. by trying to reconnect after some
> backoff period):
> https://github.com/sgroschupf/zkclient/blob/0.4.0/src/main/java/org/I0Itec/zkclient/ZkClient.java#L461
> Happy to provide more info here if I can.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)