[ 
https://issues.apache.org/jira/browse/KAFKA-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745672#comment-16745672
 ] 

Jun Rao commented on KAFKA-7836:
--------------------------------

[~lindong], it seems that we could call zkClient.propagateLogDirEvent after the 
relevant partitions are marked offline, but before 
logManager.handleLogDirFailure, to speed up the propagation of log dir failure 
to the controller. Do you see any issue with that? Thanks.

> The propagation of log dir failure can be delayed due to slowness in closing 
> the file handles
> ---------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7836
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7836
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jun Rao
>            Priority: Major
>
> In ReplicaManager.handleLogDirFailure(), we call 
> zkClient.propagateLogDirEvent after  logManager.handleLogDirFailure. The 
> latter closes the file handles of the offline replicas, which could take time 
> when the disk is bad. This will delay the new leader election by the 
> controller. In one incident, we have seen the closing of file handles of 
> multiple replicas taking more than 20 seconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to