[ 
https://issues.apache.org/jira/browse/KAFKA-15688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780399#comment-17780399
 ] 

Peter Sinoros-Szabo commented on KAFKA-15688:
---------------------------------------------

> 2) Deploy another process to watch disk health and let it kill Kafka on disk 
> hung
We actually implemented something similar. When disk health is poor, we change 
the preferred leaders away from the bad broker and do a leader election. This 
seems much safer than stopping the broker. This is working fine for us because 
we use min.insync.replicas=2 and followers were not falling out of the ISR for 
10-20 minutes, so we have plenty of time to move the leaders. 

KIP-966 seems interesting, but doesn't apply for us in this case as we didn't 
have brokers going down or falling out of the ISR, we just had that the leader 
couldn't accept more messages as it couldn't persist it. Well, I assume here 
that the Producers got the exceptions because the leader got blocked on either 
those blocking fsync calls (pull/14242) or the page-cache being fully dirty 
that couldn't be written to disk.

(But when AWS had an AZ outage, I think we had practically the same case as 
it's described in KIP-966, so it would be awesome to have that feature.)

 

I hope the patch [pull/14242|https://github.com/apache/kafka/pull/14242] will 
be completed soon, that seems a nice change as well.

 

For now our problem is mitigated by the mentioned process watching the disk 
health, but if Kafka could notice this somehow on its own, that would be great. 
But I understand the challenges with it.

> Partition leader election not running when disk IO hangs
> --------------------------------------------------------
>
>                 Key: KAFKA-15688
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15688
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.3.2
>            Reporter: Peter Sinoros-Szabo
>            Priority: Major
>
> We run our Kafka brokers on AWS EC2 nodes using AWS EBS as disk to store the 
> messages.
> Recently we had an issue when the EBS disk IO just stalled so Kafka was not 
> able to write or read anything from the disk, well except the data that was 
> still in page cache or that still fitted into the page cache before it is 
> synced to EBS.
> We experienced this issue in a few cases: sometimes partition leaders were 
> moved away to other brokers automatically, in other cases that didn't happen 
> and caused the Producers to fail producing messages to that broker.
> My expectation from Kafka in such a case would be that it notices it and 
> moves the leaders to other brokers where the partition has in sync replicas, 
> but as I mentioned this didn't happen always.
> I know Kafka will shut itself down in case it can't write to its disk, that 
> might be a good solution in this case as well as it would trigger the leader 
> election automatically.
> Is it possible to add such a feature to Kafka so that it shuts down in this 
> case as well?
> I guess similar issue might happen with other disk subsystems too or even 
> with a broken and slow disk.
> This scenario can be easily reproduced using AWS FIS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to