[ https://issues.apache.org/jira/browse/KAFKA-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mahdi updated KAFKA-2864: ------------------------- Attachment: kafka.log > Bad zookeeper host causes broker to shutdown uncleanly and stall producers > -------------------------------------------------------------------------- > > Key: KAFKA-2864 > URL: https://issues.apache.org/jira/browse/KAFKA-2864 > Project: Kafka > Issue Type: Bug > Components: zkclient > Affects Versions: 0.8.2.1 > Reporter: Mahdi > Priority: Critical > Attachments: kafka.log > > > We are using kafka 0.8.2.1 and we noticed that kafka/zookeeper-client were > not able to gracefully handle a non existing zookeeper instance. This caused > one of our brokers to get stuck during a self-inflicted shutdown and that > seemed to impact the partitions for which the broker was a leader even though > we had two other replicas. > Here is a timeline of what happened (shortened for brevity, I'll attach log > snippets): > We have a 7 node zookeeper cluster. Two of our nodes were decommissioned and > their dns records removed (zookeeper15 and zookeeper16). The decommissioning > happened about two weeks earlier. We noticed the following in the logs > - Opening socket connection to server ip-10-0-0-1.ec2.internal/10.0.0.1:2181. > Will not attempt to authenticate using SASL (unknown error) > - Client session timed out, have not heard from server in 858ms for sessionid > 0x1250c5c0f1f5001c, closing socket connection and attempting reconnect > - Opening socket connection to server ip-10.0.0.2.ec2.internal/10.0.0.2:2181. > Will not attempt to authenticate using SASL (unknown error) > - zookeeper state changed (Disconnected) > - Client session timed out, have not heard from server in 2677ms for > sessionid 0x1250c5c0f1f5001c, closing socket connection and attempting > reconnect > - Opening socket connection to server ip-10.0.0.3.ec2.internal/10.0.0.3:2181. > Will not attempt to authenticate using SASL (unknown error) > - Socket connection established to ip-10.0.0.3.ec2.internal/10.0.0.3:2181, > initiating session > - zookeeper state changed (Expired) > - Initiating client connection, > connectString=zookeeper21.example.com:2181,zookeeper19.example.com:2181,zookeeper22.example.com:2181,zookeeper18.example.com:2181,zookeeper20.example.com:2181,zookeeper16.example.com:2181,zookeeper15.example.com:2181/foo/kafka/central > sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@3bbc39f8 > - Unable to reconnect to ZooKeeper service, session 0x1250c5c0f1f5001c has > expired, closing socket connection > - Unable to re-establish connection. Notifying consumer of the following > exception: > org.I0Itec.zkclient.exception.ZkException: Unable to connect to > zookeeper21.example.com:2181,zookeeper19.example.com:2181,zookeeper22.example.com:2181,zookeeper18.example.com:2181,zookeeper20.example.com:2181,zookeeper16.example.com:2181,zookeeper15.example.com:2181/foo/kafka/central > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:69) > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1176) > at org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:649) > at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:560) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: java.net.UnknownHostException: zookeeper16.example.com: unknown > error > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) > at java.net.InetAddress.getAllByName0(InetAddress.java:1276) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at > org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380) > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:67) > ... 5 more > That seems to have caused the following: > [main-EventThread] [org.apache.zookeeper.ClientCnxn ]: EventThread shut > down > Which in turn caused kafka to shut itself down > [Thread-2] [kafka.server.KafkaServer ]: [Kafka Server 13], > shutting down > [Thread-2] [kafka.server.KafkaServer ]: [Kafka Server 13], > Starting controlled shutdown > However, the shutdown didn't go as expected apparently due to an NPE in the > zk client > 2015-11-12T12:03:40.101Z WARN [Thread-2 ] > [kafka.utils.Utils$ ]: > java.lang.NullPointerException > at org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:117) > at org.I0Itec.zkclient.ZkClient$10.call(ZkClient.java:992) > at org.I0Itec.zkclient.ZkClient$10.call(ZkClient.java:988) > at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:883) > at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:988) > at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:983) > at kafka.utils.ZkUtils$.readDataMaybeNull(ZkUtils.scala:450) > at kafka.utils.ZkUtils$.getController(ZkUtils.scala:65) > at > kafka.server.KafkaServer.kafka$server$KafkaServer$$controlledShutdown(KafkaServer.scala:194) > at > kafka.server.KafkaServer$$anonfun$shutdown$1.apply$mcV$sp(KafkaServer.scala:269) > at kafka.utils.Utils$.swallow(Utils.scala:172) > at kafka.utils.Logging$class.swallowWarn(Logging.scala:92) > at kafka.utils.Utils$.swallowWarn(Utils.scala:45) > at kafka.utils.Logging$class.swallow(Logging.scala:94) > at kafka.utils.Utils$.swallow(Utils.scala:45) > at kafka.server.KafkaServer.shutdown(KafkaServer.scala:269) > at > kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:42) > at kafka.Kafka$$anon$1.run(Kafka.scala:42) > 2015-11-12T12:03:40.106Z INFO [Thread-2 ] > [kafka.network.SocketServer ]: [Socket Server on Broker 13], > Shutting down > The kafka process continued running after this point. This is confirmed by > the continuous rolling of logs > [ReplicaFetcherThread-3-9 ] [kafka.log.Log ]: > Rolled new log segment for 'topic-a-1' in 0 ms. > [ReplicaFetcherThread-0-12 ] [kafka.log.Log ]: > Rolled new log segment for 'topic-b-4' in 0 ms. > etc.. > At this point, that broker was in a half-dead state. Our clients were still > timing out enqueuing messages to it. The under-replicated partition count on > the other brokers was stuck at a positive, constant value and did not make > any progress. We also noticed that the jmx connector threads weren't > responding, which is how we found out that the process was in a bad shape. > This happened for about 40mn till we killed the process and restarted it. > Things have recovered after the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)