Igor Maravić created KAFKA-2182:
-----------------------------------
Summary: zkClient dies if there is any exception while reconnecting
Key: KAFKA-2182
URL: https://issues.apache.org/jira/browse/KAFKA-2182
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 0.8.1
Reporter: Igor Maravić
Priority: Critical
We, Spotify, have just been hit by a BUG that's related to ZkClient. It made a
whole Kafka cluster go down.
Long story short, we've restarted TOR switch and all of our brokers from the
cluster lost all the connectivity with the zookeeper cluster, which was living
in another rack. Because of that, all the connections to Zookeeper got expired.
Everything would be fine if we haven't lost hostname to IP Address mapping for
some reason. Since we did lost that mapping, we got a UnknownHostException when
the broker tried to reconnect. This exception got swallowed up, and we never
got reconnected to Zookeeper, which in turn made our cluster of brokers useless.
If we had "handleSessionEstablishmentError" function, the whole exception could
be caught, we could just completely kill KafkaServer process and start it
cleanly, since this kind of exception is fatal for the KafkaCluster.
{code}
2015-05-05T12:49:01.709+00:00 127.0.0.1 apache-kafka[main-EventThread] INFO
zookeeper.ZooKeeper - Initiating client connection,
connectString=zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local
sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@7303d690
2015-05-05T12:49:01.711+00:00 127.0.0.1 apache-kafka[main-EventThread] ERROR
zookeeper.ClientCnxn - Error while calling watcher
2015-05-05T12:49:01.711+00:00 127.0.0.1 java.lang.RuntimeException: Exception
while restarting zk client
2015-05-05T12:49:01.711+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462)
2015-05-05T12:49:01.711+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368)
2015-05-05T12:49:01.711+00:00 127.0.0.1 at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
2015-05-05T12:49:01.711+00:00 127.0.0.1 at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2015-05-05T12:49:01.711+00:00 127.0.0.1 Caused by:
org.I0Itec.zkclient.exception.ZkException: Unable to connect to
zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local
2015-05-05T12:49:01.711+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66)
2015-05-05T12:49:01.711+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:939)
2015-05-05T12:49:01.711+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458)
2015-05-05T12:49:01.711+00:00 127.0.0.1 ... 3 more
2015-05-05T12:49:01.712+00:00 127.0.0.1 Caused by:
java.net.UnknownHostException: zookeeper1.spotify.net: Name or service not known
2015-05-05T12:49:01.712+00:00 127.0.0.1 at
java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at
java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at
java.net.InetAddress.getAllByName0(InetAddress.java:1246)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at
java.net.InetAddress.getAllByName(InetAddress.java:1162)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at
java.net.InetAddress.getAllByName(InetAddress.java:1098)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at
org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at
org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64)
2015-05-05T12:49:01.713+00:00 127.0.0.1 ... 5 more
2015-05-05T12:49:01.713+00:00 127.0.0.1
apache-kafka[ZkClient-EventThread-18-zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local]
ERROR zkclient.ZkEventThread - Error handling event ZkEvent[Children of
/config/changes changed sent to
kafka.server.TopicConfigManager$ConfigChangeListener$@17638f6]
2015-05-05T12:49:01.713+00:00 127.0.0.1 java.lang.NullPointerException
2015-05-05T12:49:01.713+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:439)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:436)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:436)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
2015-05-05T12:49:01.714+00:00 127.0.0.1 apache-kafka[main-EventThread] INFO
zookeeper.ClientCnxn - EventThread shut down
2015-05-05T12:49:01.714+00:00 127.0.0.1
apache-kafka[ZkClient-EventThread-18-zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local]
ERROR zkclient.ZkEventThread - Error handling event ZkEvent[Data of
/controller changed sent to
kafka.server.ZookeeperLeaderElector$LeaderChangeListener@18360394]
2015-05-05T12:49:01.714+00:00 127.0.0.1 java.lang.NullPointerException
2015-05-05T12:49:01.714+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:439)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:436)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:436)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:544)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)