[ 
https://issues.apache.org/jira/browse/KAFKA-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100018#comment-17100018
 ] 

Ismael Juma commented on KAFKA-9957:
------------------------------------

Can you configure a timeout at the OS level? For example, see the following 
documentation from Amazon:
{quote}Most operating systems specify a timeout for I/O operations submitted to 
NVMe devices. The default timeout is 30 seconds and can be changed using the 
{{nvme_core.io_timeout}} boot parameter. With Linux kernels earlier than 
version 4.6, this parameter is {{nvme.io_timeout}}.
{quote}
[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html]

> Kafka Controller doesn't failover during hardware failure
> ---------------------------------------------------------
>
>                 Key: KAFKA-9957
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9957
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 2.2.0, 2.5.0
>            Reporter: Eric Ward
>            Priority: Critical
>         Attachments: kafka-threaddump.out
>
>
> On a couple different production environments we've run into an issue where a 
> hardware failure has hung up the controller and prevented controller and 
> topic leadership from changing to a healthy broker.  When the issue happens 
> we see this repeated in the logs at regular intervals for the other brokers 
> (the affected broker can’t write to disk, so no logging occurs there):
> {noformat}
> [2020-04-26 01:12:30,613] WARN [ReplicaFetcher replicaId=0, leaderId=2, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=0, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={*snip*}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1962806970, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read
>       at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
>       at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:100)
>       at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:193)
>       at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:280)
>       at 
> kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:132)
>       at 
> kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:131)
>       at scala.Option.foreach(Option.scala:274)
>       at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:131)
>       at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
>       at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {noformat}
> This issue appears to be similar to KAFKA-7870, though that issue was 
> purportedly fixed by KAFKA-7697.
> Once we encounter this error any partitions whose leadership is on the 
> affected node are unavailable until we force that broker out of the cluster – 
> that is to say, kill the node.
> When we initially hit the issue we were running on version 2.2.0, though I've 
> been able to reproduce this in an environment running 2.5.0 as well. To 
> simulate the hardware failure I'm using the xfs_freeze utility to suspend 
> access to the filesystem.  Zookeeper failover is also part of the mix.  In 
> all instances where we’ve seen this the ZK leader and Kafka Controller were 
> on the same node and both affected by the hardware issue.  Zookeeper is able 
> to successfully failover, which it does rather quickly.
> Reproduction steps are pretty straightforward:
>  # Spin up a 3 node cluster
>  # Ensure that the Kafka Controller and Zookeeper Leader are on the same node.
>  # xfs_freeze the filesystem on the node that the controller is running on
> This reproduces 100% of the time for me.  I’ve left it running for well over 
> an hour without any Kafka failover happening.  Unfreezing the node will allow 
> the cluster to heal itself.
> I’ve attached a thread dump from an environment running 2.5.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to