[jira] [Updated] (KAFKA-3984) Broker doesn't retry reconnecting to an expired Zookeeper connection

Braedon Vickers (JIRA) Thu, 21 Jul 2016 22:46:53 -0700

     [ 
https://issues.apache.org/jira/browse/KAFKA-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Braedon Vickers updated KAFKA-3984:
-----------------------------------
    Description: 
We've been having issues with the network connectivity of our Kafka cluster, 
and this seems to be triggering an issue where the brokers stop trying to 
reconnect to Zookeeper, leaving us with a broken cluster even when the network 
has recovered.

When network issues begin we see {{java.net.NoRouteToHostException}} exceptions 
from {{org.apache.zookeeper.ClientCnxn}} as it attempts to re-establish the 
connection. If the network issue resolves itself while we are only getting 
these errors the broker seems to reconnect fine.

However, a lot of the time we end up with a message like this:
{code}[2016-07-22 00:21:44,181] FATAL Could not establish session with 
zookeeper (kafka.server.KafkaHealthcheck)
org.I0Itec.zkclient.exception.ZkException: Unable to connect to <zookeeper 
hosts>
        at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71)
        at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1279)
...
Caused by: java.net.UnknownHostException: <zookeeper host>
        at java.net.InetAddress.getAllByName(InetAddress.java:1126)
        at java.net.InetAddress.getAllByName(InetAddress.java:1192)
        at 
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
...
{code}
(apologies for the partial stack traces - I'm having to try and reconstruct 
them from a less than ideal centralised logging setup.)

If this happens, the broker stops trying to reconnect to Zookeeper, and we have 
to restart it.

It looks like while the {{org.apache.zookeeper.Zookeeper}} client's state isn't 
{{Expired}} it will keep retrying the connection, and will recover OK when the 
network is back. However, once it changes to {{Expired}} (not entirely sure how 
that happens - based on the session timeout perhaps?) zkclient closes the 
existing client and attempts to create a new one. If the network is still down, 
the client constructor throws a {{java.net.UnknownHostException}}, zkclient 
calls {{handleSessionEstablishmentError()}} on {{KafkaHealthcheck}}, 
{{KafkaHealthcheck.handleSessionEstablishmentError()}} logs a "Fatal" error and 
does nothing else.

It seems like some form of retry needs to happen here, or the broker is stuck 
with no Zookeeper connection 
indefinitely.{{KafkaHealthcheck.handleSessionEstablishmentError()}} used to 
kill the JVM, but that was removed in 
https://issues.apache.org/jira/browse/KAFKA-2405. Killing the JVM would be 
better than doing nothing, as then your init system could restart it, allowing 
it to recover once the network was back.

Our cluster is running 0.9.0.1, so not sure if it affects 0.10.0.0 as well. 
However, it seems likely, as there doesn't seem to be any code changes in kafka 
or zkclient that would affect this behaviour.

  was:
We've been having issues with the network connectivity of our Kafka cluster, 
and this seems to be triggering an issue where the brokers stop trying to 
reconnect to Zookeeper, leaving us with a broken cluster even when the network 
has recovered.

When network issues begin we see {{java.net.NoRouteToHostException}} exceptions 
from {{org.apache.zookeeper.ClientCnxn}} as it attempts to re-establish the 
connection. If the network issue resolves itself while we are only getting 
these errors the broker seems to reconnect fine.

However, a lot of the time we end up with a message like this:
{code}[2016-07-22 00:21:44,181] FATAL Could not establish session with 
zookeeper (kafka.server.KafkaHealthcheck)
org.I0Itec.zkclient.exception.ZkException: Unable to connect to <zookeeper 
hosts>
        at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71)
        at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1279)
...
Caused by: java.net.UnknownHostException: <zookeeper host>
        at java.net.InetAddress.getAllByName(InetAddress.java:1126)
        at java.net.InetAddress.getAllByName(InetAddress.java:1192)
        at 
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
...
{code}
(apologies for the partial stack traces - I'm having to try and reconstruct 
them from a less than ideal centralised logging setup.)

If this happens, the broker stops trying to reconnect to Zookeeper, and we have 
to restart it.

It looks like while the {{org.apache.zookeeper.Zookeeper}} client's state isn't 
{{Expired}} it will keep retrying the connection, and will recover OK when the 
network is back. However, once it changes to {{Expired}} (not entirely sure how 
that happens - based on the session timeout perhaps?) zkclient closes the 
existing client and attempts to create a new one. If the network is still down, 
the client constructor throws a {{java.net.UnknownHostException}}, zkclient 
calls {{handleSessionEstablishmentError()}} on {{KafkaHealthcheck}}, 
{{KafkaHealthcheck.handleSessionEstablishmentError()}} logs a "Fatal" error and 
does nothing else.

It seems like some form of retry needs to happen here, or the broker is stuck 
with no Zookeeper connection 
indefinitely.{{KafkaHealthcheck.handleSessionEstablishmentError()}} used to 
kill the JVM, but that was removed in 
https://issues.apache.org/jira/browse/KAFKA-2405. Killing the JVM would be 
better than doing nothing, as then your init system could restart it, allowing 
it to recover once the network was back.


> Broker doesn't retry reconnecting to an expired Zookeeper connection
> --------------------------------------------------------------------
>
>                 Key: KAFKA-3984
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3984
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.9.0.1
>            Reporter: Braedon Vickers
>
> We've been having issues with the network connectivity of our Kafka cluster, 
> and this seems to be triggering an issue where the brokers stop trying to 
> reconnect to Zookeeper, leaving us with a broken cluster even when the 
> network has recovered.
> When network issues begin we see {{java.net.NoRouteToHostException}} 
> exceptions from {{org.apache.zookeeper.ClientCnxn}} as it attempts to 
> re-establish the connection. If the network issue resolves itself while we 
> are only getting these errors the broker seems to reconnect fine.
> However, a lot of the time we end up with a message like this:
> {code}[2016-07-22 00:21:44,181] FATAL Could not establish session with 
> zookeeper (kafka.server.KafkaHealthcheck)
> org.I0Itec.zkclient.exception.ZkException: Unable to connect to <zookeeper 
> hosts>
>       at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71)
>       at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1279)
> ...
> Caused by: java.net.UnknownHostException: <zookeeper host>
>       at java.net.InetAddress.getAllByName(InetAddress.java:1126)
>       at java.net.InetAddress.getAllByName(InetAddress.java:1192)
>       at 
> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
>       at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
> ...
> {code}
> (apologies for the partial stack traces - I'm having to try and reconstruct 
> them from a less than ideal centralised logging setup.)
> If this happens, the broker stops trying to reconnect to Zookeeper, and we 
> have to restart it.
> It looks like while the {{org.apache.zookeeper.Zookeeper}} client's state 
> isn't {{Expired}} it will keep retrying the connection, and will recover OK 
> when the network is back. However, once it changes to {{Expired}} (not 
> entirely sure how that happens - based on the session timeout perhaps?) 
> zkclient closes the existing client and attempts to create a new one. If the 
> network is still down, the client constructor throws a 
> {{java.net.UnknownHostException}}, zkclient calls 
> {{handleSessionEstablishmentError()}} on {{KafkaHealthcheck}}, 
> {{KafkaHealthcheck.handleSessionEstablishmentError()}} logs a "Fatal" error 
> and does nothing else.
> It seems like some form of retry needs to happen here, or the broker is stuck 
> with no Zookeeper connection 
> indefinitely.{{KafkaHealthcheck.handleSessionEstablishmentError()}} used to 
> kill the JVM, but that was removed in 
> https://issues.apache.org/jira/browse/KAFKA-2405. Killing the JVM would be 
> better than doing nothing, as then your init system could restart it, 
> allowing it to recover once the network was back.
> Our cluster is running 0.9.0.1, so not sure if it affects 0.10.0.0 as well. 
> However, it seems likely, as there doesn't seem to be any code changes in 
> kafka or zkclient that would affect this behaviour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-3984) Broker doesn't retry reconnecting to an expired Zookeeper connection

Reply via email to