Braedon Vickers created KAFKA-3984:
--------------------------------------
Summary: Broker doesn't retry reconnecting to an expired Zookeeper
connection
Key: KAFKA-3984
URL: https://issues.apache.org/jira/browse/KAFKA-3984
Project: Kafka
Issue Type: Bug
Affects Versions: 0.9.0.1
Reporter: Braedon Vickers
We've been having issues with the network connectivity of our Kafka cluster,
and this seems to be triggering an issue where the brokers stop trying to
reconnect to Zookeeper, leaving us with a broken cluster even when the network
has recovered.
When network issues begin we see {{java.net.NoRouteToHostException}} exceptions
from {{org.apache.zookeeper.ClientCnxn}} as it attempts to re-establish the
connection. If the network issue resolves itself while we are only getting
these errors the broker seems to reconnect fine.
However, a lot of the time we end up with a message like this:
{code}[2016-07-22 00:21:44,181] FATAL Could not establish session with
zookeeper (kafka.server.KafkaHealthcheck)
org.I0Itec.zkclient.exception.ZkException: Unable to connect to <zookeeper
hosts>
at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71)
at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1279)
...
Caused by: java.net.UnknownHostException: <zookeeper host>
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
...
{code}
(apologies for the partial stack traces - I'm having to try and reconstruct
them from a less than ideal centralised logging setup.)
If this happens, the broker stops trying to reconnect to Zookeeper, and we have
to restart it.
It looks like while the {{org.apache.zookeeper.Zookeeper}} client's state isn't
{{Expired}} it will keep retrying the connection, and will recover OK when the
network is back. However, once it changes to {{Expired}} (not entirely sure how
that happens - based on the session timeout perhaps?) zkclient closes the
existing client and attempts to create a new one. If the network is still down,
the client constructor throws a {{java.net.UnknownHostException}}, zkclient
calls {{handleSessionEstablishmentError()}} on {{KafkaHealthcheck}},
{{KafkaHealthcheck.handleSessionEstablishmentError()}} logs a "Fatal" error and
does nothing else.
It seems like some form of retry needs to happen here, or the broker is stuck
with no Zookeeper connection
indefinitely.{{KafkaHealthcheck.handleSessionEstablishmentError()}} used to
kill the JVM, but that was removed in
https://issues.apache.org/jira/browse/KAFKA-2405. Killing the JVM would be
better than doing nothing, as then your init system could restart it, allowing
it to recover once the network was back.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)