Re: Partial broker shutdown causing producers to stall even with replicas

Mahdi Ben Hamida Thu, 19 Nov 2015 14:59:38 -0800

I've logged this as https://issues.apache.org/jira/browse/KAFKA-2864


--
Mahdi.

On 11/16/15 12:59 PM, Rajiv Kurian wrote:

Yes I think so. We specifically upgraded the Kafka broker with a patch to
avoid the ZK client NPEs. Guess not all of them are fixed. The Kafka broker
becoming a zombie even if one ZK node is bad is especially terrible.

On Tuesday, November 17, 2015, Mahdi Ben Hamida <ma...@signalfx.com> wrote:

Hello,

See below for my original email. I was wondering if anybody has feedback
on the 4 questions I've asked. Should I go ahead and file this as a bug ?

Thanks.

--
Mahdi.

On 11/12/15 2:37 PM, Mahdi Ben Hamida wrote:

Hi Everyone,

We are using kafka 0.8.2.1 and we noticed that kafka/zookeeper-client
were not able to gracefully handle a non existing zookeeper instance. This
caused one of our brokers to get stuck during a shutdown and that seemed to
impact the partitions for which the broker was a leader even though we had
two other replicas.

Here is a timeline of what happened (shortened for brevity, longer logs
here [*]):

We have a 7 node zookeeper cluster. Two of our nodes were decommissioned
and their dns records removed (zookeeper15 and zookeeper16). The
decommissioning happened about two weeks earlier. We noticed the following
in the logs

- Opening socket connection to server ip-10-0-0-1.ec2.internal/
10.0.0.1:2181. Will not attempt to authenticate using SASL (unknown
error)
- Client session timed out, have not heard from server in 858ms for
sessionid 0x1250c5c0f1f5001c, closing socket connection and attempting
reconnect
- Opening socket connection to server ip-10.0.0.2.ec2.internal/
10.0.0.2:2181. Will not attempt to authenticate using SASL (unknown
error)
- zookeeper state changed (Disconnected)
- Client session timed out, have not heard from server in 2677ms for
sessionid 0x1250c5c0f1f5001c, closing socket connection and attempting
reconnect
- Opening socket connection to server ip-10.0.0.3.ec2.internal/
10.0.36.107:2181. Will not attempt to authenticate using SASL (unknown
error)
- Socket connection established to ip-10.0.0.3.ec2.internal/10.0.0.2:2181,
initiating session
- zookeeper state changed (Expired)
- Initiating client connection, connectString=
zookeeper21.example.com:2181,zookeeper19.example.com:2181,
zookeeper22.example.com:2181,zookeeper18.example.com:2181,
zookeeper20.example.com:2181,zookeeper16.example.com:2181,
zookeeper15.example.com:2181/foo/kafka/central sessionTimeout=6000
watcher=org.I0Itec.zkclient.ZkClient@3bbc39f8
- Unable to reconnect to ZooKeeper service, session 0x1250c5c0f1f5001c
has expired, closing socket connection
- Unable to re-establish connection. Notifying consumer of the following
exception:
org.I0Itec.zkclient.exception.ZkException: Unable to connect to
zookeeper21.example.com:2181,zookeeper19.example.com:2181,
zookeeper22.example.com:2181,zookeeper18.example.com:2181,
zookeeper20.example.com:2181,zookeeper16.example.com:2181,
zookeeper15.example.com:2181/foo/kafka/central
         at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:69)
         at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1176)
         at
org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:649)
         at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:560)
         at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
         at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
*Caused by: java.net.UnknownHostException: zookeeper16.example.com:
unknown error*
         at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
         at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
         at
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
         at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
         at java.net.InetAddress.getAllByName(InetAddress.java:1192)
         at java.net.InetAddress.getAllByName(InetAddress.java:1126)
         at
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
         at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
         at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
         at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:67)
         ... 5 more


That seems to have caused the following:
  [main-EventThread] [org.apache.zookeeper.ClientCnxn     ]: EventThread
shut down

Which in turn caused kafka to shut itself down
[Thread-2] [kafka.server.KafkaServer            ]: [Kafka Server 13],
shutting down
[Thread-2] [kafka.server.KafkaServer            ]: [Kafka Server 13],
Starting controlled shutdown

However, the shutdown didn't go as expected apparently due to an NPE in
the zk client

2015-11-12T12:03:40.101Z WARN  [Thread-2 ] [kafka.utils.Utils$
       ]:
*java.lang.NullPointerException*
         at
org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:117)
         at org.I0Itec.zkclient.ZkClient$10.call(ZkClient.java:992)
         at org.I0Itec.zkclient.ZkClient$10.call(ZkClient.java:988)
         at
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:883)
         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:988)
         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:983)
         at kafka.utils.ZkUtils$.readDataMaybeNull(ZkUtils.scala:450)
         at kafka.utils.ZkUtils$.getController(ZkUtils.scala:65)
         at
kafka.server.KafkaServer.kafka$server$KafkaServer$$controlledShutdown(KafkaServer.scala:194)
         at
kafka.server.KafkaServer$$anonfun$shutdown$1.apply$mcV$sp(KafkaServer.scala:269)
         at kafka.utils.Utils$.swallow(Utils.scala:172)
         at kafka.utils.Logging$class.swallowWarn(Logging.scala:92)
         at kafka.utils.Utils$.swallowWarn(Utils.scala:45)
         at kafka.utils.Logging$class.swallow(Logging.scala:94)
         at kafka.utils.Utils$.swallow(Utils.scala:45)
         at kafka.server.KafkaServer.shutdown(KafkaServer.scala:269)
         at
kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:42)
         at kafka.Kafka$$anon$1.run(Kafka.scala:42)
2015-11-12T12:03:40.106Z INFO  [Thread-2 ] [kafka.network.SocketServer
       ]: [Socket Server on Broker 13], Shutting down

The kafka process continued running after this point. This is confirmed
by the continuous rolling of logs
[ReplicaFetcherThread-3-9           ] [kafka.log.Log
  ]: Rolled new log segment for 'topic-a-1' in 0 ms.
[ReplicaFetcherThread-0-12          ] [kafka.log.Log
  ]: Rolled new log segment for 'topic-b-4' in 0 ms.

etc..

At this point, that broker was in a half-dead state. Our clients were
still timing out enqueuing messages to it. The under-replicated partition
count on the other brokers was stuck at a positive, constant value and did
not make any progress. We also noticed that the jmx connector threads
weren't responding, which is how we found out that the process was in a bad
shape. This happened for about 40mn till we killed the process and
restarted it. Things have recovered after the restart.

1. is this a known kafka/zookeeper issue impacting the version we are
running ? If not, I'd be happy to file a bug.
2. since we had other healthy zookeeper instance (5 out of 7), is there a
reason why kafka/zkclient couldn't have handled this more gracefully. My
assumption was that the zookeeper client will pick other instances from the
list until a healthy node is found. It is odd that it gave up this quickly.
3. on the kafka side, the server decided to shutdown by itself, however
the shutdown should have been a clean one. The NPE should either have been
avoided, or caught and handled differently. It would be good to have some
clarifications on when this self-shutdown is invoked and whether we could
have done things differently on our side to avoid it (for example, restart
the broker after decommissioning some of our zookeeper nodes).
4. more generally, what do other kafka users do when recycling their
zookeeper clusters or replacing old machines with new ones.

Thanks.
--
Mahdi.

[*]https://gist.github.com/mahdibh/76b230e25a3f7113349e

Re: Partial broker shutdown causing producers to stall even with replicas

Reply via email to