[jira] [Commented] (KAFKA-9957) Kafka Controller doesn't failover during hardware failure

Eric Ward (Jira) Tue, 05 May 2020 17:38:13 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100359#comment-17100359
 ]


Eric Ward commented on KAFKA-9957:
----------------------------------

We are using the default configuration (30 seconds) and it doesn't seem to have 
any affect with regards to the xfs_freeze utility.  The same settings are also 
in place in our production environments and disk I/O was similarly unaffected 
by this setting.

However, during the production incident we were seeing the following:
{noformat}
[root@aprod-1 log]# grep "I/O" messages
Apr 25 16:01:11 aprod-1 kernel: [2172735.297527] nvme nvme4: I/O 0 QID 1 
timeout, aborting
Apr 25 16:01:11 aprod-1 kernel: [2172735.298766] nvme nvme4: I/O 1 QID 1 
timeout, aborting
Apr 25 16:01:11 aprod-1 kernel: [2172735.300008] nvme nvme4: I/O 2 QID 1 
timeout, aborting
Apr 25 16:01:11 aprod-1 kernel: [2172735.301301] nvme nvme4: I/O 25 QID 1 
timeout, aborting
Apr 25 16:01:11 aprod-1 kernel: [2172735.302645] nvme nvme4: I/O 26 QID 1 
timeout, aborting
Apr 25 19:34:58 aprod-1 kernel: [2185562.190076] nvme nvme4: I/O 11 QID 2 
timeout, aborting
Apr 25 19:34:58 aprod-1 kernel: [2185562.191395] nvme nvme4: I/O 12 QID 2 
timeout, aborting
Apr 25 19:34:58 aprod-1 kernel: [2185562.192741] nvme nvme4: I/O 13 QID 2 
timeout, aborting
Apr 25 19:34:59 aprod-1 kernel: [2185563.190079] nvme nvme4: I/O 14 QID 2 
timeout, aborting
Apr 25 19:35:00 aprod-1 kernel: [2185564.190068] nvme nvme4: I/O 15 QID 2 
timeout, aborting
Apr 25 19:35:29 aprod-1 kernel: [2185593.189852] nvme nvme4: I/O 11 QID 2 
timeout, reset controller
Apr 25 19:35:29 aprod-1 kernel: [2185593.306412] blk_update_request: I/O error, 
dev nvme4n1, sector 39955384
Apr 25 19:35:29 aprod-1 kernel: [2185593.307980] blk_update_request: I/O error, 
dev nvme4n1, sector 118072528
Apr 25 19:35:29 aprod-1 kernel: [2185593.309592] blk_update_request: I/O error, 
dev nvme4n1, sector 1338160
Apr 25 19:36:00 aprod-1 kernel: [2185624.197631] nvme nvme4: I/O 7 QID 1 
timeout, disable controller
{noformat}
Which suggests that the OS level timeout was being hit, but that was not 
translating to any sort of I/O exception at the application layer. I 
unfortunately don't have much more insight into what was happening at the 
hardware level other than EBS in that AZ was having some sort of issue at the 
time and "Some of the EBS volumes attached to your instances are operating with 
degraded performance".

> Kafka Controller doesn't failover during hardware failure
> ---------------------------------------------------------
>
>                 Key: KAFKA-9957
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9957
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 2.2.0, 2.5.0
>            Reporter: Eric Ward
>            Priority: Critical
>         Attachments: kafka-threaddump.out
>
>
> On a couple different production environments we've run into an issue where a 
> hardware failure has hung up the controller and prevented controller and 
> topic leadership from changing to a healthy broker.  When the issue happens 
> we see this repeated in the logs at regular intervals for the other brokers 
> (the affected broker can’t write to disk, so no logging occurs there):
> {noformat}
> [2020-04-26 01:12:30,613] WARN [ReplicaFetcher replicaId=0, leaderId=2, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=0, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={*snip*}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1962806970, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read
>       at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
>       at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:100)
>       at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:193)
>       at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:280)
>       at 
> kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:132)
>       at 
> kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:131)
>       at scala.Option.foreach(Option.scala:274)
>       at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:131)
>       at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
>       at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {noformat}
> This issue appears to be similar to KAFKA-7870, though that issue was 
> purportedly fixed by KAFKA-7697.
> Once we encounter this error any partitions whose leadership is on the 
> affected node are unavailable until we force that broker out of the cluster – 
> that is to say, kill the node.
> When we initially hit the issue we were running on version 2.2.0, though I've 
> been able to reproduce this in an environment running 2.5.0 as well. To 
> simulate the hardware failure I'm using the xfs_freeze utility to suspend 
> access to the filesystem.  Zookeeper failover is also part of the mix.  In 
> all instances where we’ve seen this the ZK leader and Kafka Controller were 
> on the same node and both affected by the hardware issue.  Zookeeper is able 
> to successfully failover, which it does rather quickly.
> Reproduction steps are pretty straightforward:
>  # Spin up a 3 node cluster
>  # Ensure that the Kafka Controller and Zookeeper Leader are on the same node.
>  # xfs_freeze the filesystem on the node that the controller is running on
> This reproduces 100% of the time for me.  I’ve left it running for well over 
> an hour without any Kafka failover happening.  Unfreezing the node will allow 
> the cluster to heal itself.
> I’ve attached a thread dump from an environment running 2.5.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9957) Kafka Controller doesn't failover during hardware failure

Reply via email to