Eric Ward created KAFKA-9957:
--------------------------------

             Summary: Kafka Controller doesn't failover during hardware failure
                 Key: KAFKA-9957
                 URL: https://issues.apache.org/jira/browse/KAFKA-9957
             Project: Kafka
          Issue Type: Bug
          Components: controller
    Affects Versions: 2.5.0, 2.2.0
            Reporter: Eric Ward
         Attachments: kafka-threaddump.out

On a couple different production environments we've run into an issue where a 
hardware failure has hung up the controller and prevented controller and topic 
leadership from changing to a healthy broker.  When the issue happens we see 
this repeated in the logs at regular intervals for the other brokers (the 
affected broker can’t write to disk, so no logging occurs there):
{noformat}
[2020-04-26 01:12:30,613] WARN [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=0, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={*snip*}, 
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1962806970, 
epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 2 was disconnected before the response was 
read
        at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
        at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:100)
        at 
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:193)
        at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:280)
        at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:132)
        at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:131)
        at scala.Option.foreach(Option.scala:274)
        at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:131)
        at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

{noformat}
This issue appears to be similar to KAFKA-7870, though that issue was 
purportedly fixed by KAFKA-7697.

Once we encounter this error any partitions whose leadership is on the affected 
node are unavailable until we force that broker out of the cluster – that is to 
say, kill the node.

When we initially hit the issue we were running on version 2.2.0, though I've 
been able to reproduce this in an environment running 2.5.0 as well. To 
simulate the hardware failure I'm using the xfs_freeze utility to suspend 
access to the filesystem.  Zookeeper failover is also part of the mix.  In all 
instances where we’ve seen this the ZK leader and Kafka Controller were on the 
same node and both affected by the hardware issue.  Zookeeper is able to 
successfully failover, which it does rather quickly.

Reproduction steps are pretty straightforward:
 # Spin up a 3 node cluster
 # Ensure that the Kafka Controller and Zookeeper Leader are on the same node.
 # xfs_freeze the filesystem on the node that the controller is running on

This reproduces 100% of the time for me.  I’ve left it running for well over an 
hour without any Kafka failover happening.  Unfreezing the node will allow the 
cluster to heal itself.

I’ve attached a thread dump from an environment running 2.5.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to