[ https://issues.apache.org/jira/browse/KAFKA-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100018#comment-17100018 ]
Ismael Juma commented on KAFKA-9957: ------------------------------------ Can you configure a timeout at the OS level? For example, see the following documentation from Amazon: {quote}Most operating systems specify a timeout for I/O operations submitted to NVMe devices. The default timeout is 30 seconds and can be changed using the {{nvme_core.io_timeout}} boot parameter. With Linux kernels earlier than version 4.6, this parameter is {{nvme.io_timeout}}. {quote} [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html] > Kafka Controller doesn't failover during hardware failure > --------------------------------------------------------- > > Key: KAFKA-9957 > URL: https://issues.apache.org/jira/browse/KAFKA-9957 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 2.2.0, 2.5.0 > Reporter: Eric Ward > Priority: Critical > Attachments: kafka-threaddump.out > > > On a couple different production environments we've run into an issue where a > hardware failure has hung up the controller and prevented controller and > topic leadership from changing to a healthy broker. When the issue happens > we see this repeated in the logs at regular intervals for the other brokers > (the affected broker can’t write to disk, so no logging occurs there): > {noformat} > [2020-04-26 01:12:30,613] WARN [ReplicaFetcher replicaId=0, leaderId=2, > fetcherId=0] Error in response for fetch request (type=FetchRequest, > replicaId=0, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={*snip*}, > isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1962806970, > epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 2 was disconnected before the response was > read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:100) > at > kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:193) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:280) > at > kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:132) > at > kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:131) > at scala.Option.foreach(Option.scala:274) > at > kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:131) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > {noformat} > This issue appears to be similar to KAFKA-7870, though that issue was > purportedly fixed by KAFKA-7697. > Once we encounter this error any partitions whose leadership is on the > affected node are unavailable until we force that broker out of the cluster – > that is to say, kill the node. > When we initially hit the issue we were running on version 2.2.0, though I've > been able to reproduce this in an environment running 2.5.0 as well. To > simulate the hardware failure I'm using the xfs_freeze utility to suspend > access to the filesystem. Zookeeper failover is also part of the mix. In > all instances where we’ve seen this the ZK leader and Kafka Controller were > on the same node and both affected by the hardware issue. Zookeeper is able > to successfully failover, which it does rather quickly. > Reproduction steps are pretty straightforward: > # Spin up a 3 node cluster > # Ensure that the Kafka Controller and Zookeeper Leader are on the same node. > # xfs_freeze the filesystem on the node that the controller is running on > This reproduces 100% of the time for me. I’ve left it running for well over > an hour without any Kafka failover happening. Unfreezing the node will allow > the cluster to heal itself. > I’ve attached a thread dump from an environment running 2.5.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)