Re: Partial broker shutdown causing producers to stall even with replicas

Mahdi Ben Hamida Mon, 16 Nov 2015 09:49:02 -0800

Hello,

See below for my original email. I was wondering if anybody has feedbackon the 4 questions I've asked. Should I go ahead and file this as a bug ?


Thanks.

--
Mahdi.

On 11/12/15 2:37 PM, Mahdi Ben Hamida wrote:

Hi Everyone,
We are using kafka 0.8.2.1 and we noticed that kafka/zookeeper-clientwere not able to gracefully handle a non existing zookeeper instance.This caused one of our brokers to get stuck during a shutdown and thatseemed to impact the partitions for which the broker was a leader eventhough we had two other replicas.
Here is a timeline of what happened (shortened for brevity, longerlogs here [*]):
We have a 7 node zookeeper cluster. Two of our nodes weredecommissioned and their dns records removed (zookeeper15 andzookeeper16). The decommissioning happened about two weeks earlier. Wenoticed the following in the logs
- Opening socket connection to serverip-10-0-0-1.ec2.internal/10.0.0.1:2181. Will not attempt toauthenticate using SASL (unknown error)- Client session timed out, have not heard from server in 858ms forsessionid 0x1250c5c0f1f5001c, closing socket connection and attemptingreconnect- Opening socket connection to serverip-10.0.0.2.ec2.internal/10.0.0.2:2181. Will not attempt toauthenticate using SASL (unknown error)
- zookeeper state changed (Disconnected)
- Client session timed out, have not heard from server in 2677ms forsessionid 0x1250c5c0f1f5001c, closing socket connection and attemptingreconnect- Opening socket connection to serverip-10.0.0.3.ec2.internal/10.0.36.107:2181. Will not attempt toauthenticate using SASL (unknown error)- Socket connection established toip-10.0.0.3.ec2.internal/10.0.0.2:2181, initiating session
- zookeeper state changed (Expired)
- Initiating client connection,connectString=zookeeper21.example.com:2181,zookeeper19.example.com:2181,zookeeper22.example.com:2181,zookeeper18.example.com:2181,zookeeper20.example.com:2181,zookeeper16.example.com:2181,zookeeper15.example.com:2181/foo/kafka/centralsessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@3bbc39f8- Unable to reconnect to ZooKeeper service, session 0x1250c5c0f1f5001chas expired, closing socket connection- Unable to re-establish connection. Notifying consumer of thefollowing exception:org.I0Itec.zkclient.exception.ZkException: Unable to connect tozookeeper21.example.com:2181,zookeeper19.example.com:2181,zookeeper22.example.com:2181,zookeeper18.example.com:2181,zookeeper20.example.com:2181,zookeeper16.example.com:2181,zookeeper15.example.com:2181/foo/kafka/central
        at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:69)
        at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1176)
atorg.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:649)
        at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:560)
atorg.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)atorg.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)*Caused by: java.net.UnknownHostException: zookeeper16.example.com:unknown error*
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
atjava.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
        at java.net.InetAddress.getAllByName(InetAddress.java:1192)
        at java.net.InetAddress.getAllByName(InetAddress.java:1126)
atorg.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
        at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:67)
        ... 5 more


That seems to have caused the following:
[main-EventThread] [org.apache.zookeeper.ClientCnxn ]:EventThread shut down
Which in turn caused kafka to shut itself down
[Thread-2] [kafka.server.KafkaServer ]: [Kafka Server 13],shutting down[Thread-2] [kafka.server.KafkaServer ]: [Kafka Server 13],Starting controlled shutdown
However, the shutdown didn't go as expected apparently due to an NPEin the zk client
2015-11-12T12:03:40.101Z WARN [Thread-2 ][kafka.utils.Utils$ ]:
*java.lang.NullPointerException*
atorg.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:117)
        at org.I0Itec.zkclient.ZkClient$10.call(ZkClient.java:992)
        at org.I0Itec.zkclient.ZkClient$10.call(ZkClient.java:988)
atorg.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:883)
        at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:988)
        at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:983)
        at kafka.utils.ZkUtils$.readDataMaybeNull(ZkUtils.scala:450)
        at kafka.utils.ZkUtils$.getController(ZkUtils.scala:65)
atkafka.server.KafkaServer.kafka$server$KafkaServer$$controlledShutdown(KafkaServer.scala:194)atkafka.server.KafkaServer$$anonfun$shutdown$1.apply$mcV$sp(KafkaServer.scala:269)
        at kafka.utils.Utils$.swallow(Utils.scala:172)
        at kafka.utils.Logging$class.swallowWarn(Logging.scala:92)
        at kafka.utils.Utils$.swallowWarn(Utils.scala:45)
        at kafka.utils.Logging$class.swallow(Logging.scala:94)
        at kafka.utils.Utils$.swallow(Utils.scala:45)
        at kafka.server.KafkaServer.shutdown(KafkaServer.scala:269)
atkafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:42)
        at kafka.Kafka$$anon$1.run(Kafka.scala:42)
2015-11-12T12:03:40.106Z INFO [Thread-2 ][kafka.network.SocketServer ]: [Socket Server on Broker 13],Shutting down
The kafka process continued running after this point. This isconfirmed by the continuous rolling of logs[ReplicaFetcherThread-3-9 ][kafka.log.Log ]: Rolled new log segment for'topic-a-1' in 0 ms.[ReplicaFetcherThread-0-12 ][kafka.log.Log ]: Rolled new log segment for'topic-b-4' in 0 ms.
etc..
At this point, that broker was in a half-dead state. Our clients werestill timing out enqueuing messages to it. The under-replicatedpartition count on the other brokers was stuck at a positive, constantvalue and did not make any progress. We also noticed that the jmxconnector threads weren't responding, which is how we found out thatthe process was in a bad shape. This happened for about 40mn till wekilled the process and restarted it. Things have recovered after therestart.
1. is this a known kafka/zookeeper issue impacting the version we arerunning ? If not, I'd be happy to file a bug.2. since we had other healthy zookeeper instance (5 out of 7), isthere a reason why kafka/zkclient couldn't have handled this moregracefully. My assumption was that the zookeeper client will pickother instances from the list until a healthy node is found. It is oddthat it gave up this quickly.3. on the kafka side, the server decided to shutdown by itself,however the shutdown should have been a clean one. The NPE shouldeither have been avoided, or caught and handled differently. It wouldbe good to have some clarifications on when this self-shutdown isinvoked and whether we could have done things differently on our sideto avoid it (for example, restart the broker after decommissioningsome of our zookeeper nodes).4. more generally, what do other kafka users do when recycling theirzookeeper clusters or replacing old machines with new ones.
Thanks.
--
Mahdi.

[*]https://gist.github.com/mahdibh/76b230e25a3f7113349e

Re: Partial broker shutdown causing producers to stall even with replicas

Reply via email to