[jira] [Commented] (KAFKA-7836) The propagation of log dir failure can be delayed due to slowness in closing the file handles

Jun Rao (JIRA) Thu, 17 Jan 2019 16:56:37 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745672#comment-16745672
 ]


Jun Rao commented on KAFKA-7836:
--------------------------------

[~lindong], it seems that we could call zkClient.propagateLogDirEvent after the 
relevant partitions are marked offline, but before 
logManager.handleLogDirFailure, to speed up the propagation of log dir failure 
to the controller. Do you see any issue with that? Thanks.

> The propagation of log dir failure can be delayed due to slowness in closing 
> the file handles
> ---------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7836
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7836
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jun Rao
>            Priority: Major
>
> In ReplicaManager.handleLogDirFailure(), we call 
> zkClient.propagateLogDirEvent after  logManager.handleLogDirFailure. The 
> latter closes the file handles of the offline replicas, which could take time 
> when the disk is bad. This will delay the new leader election by the 
> controller. In one incident, we have seen the closing of file handles of 
> multiple replicas taking more than 20 seconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-7836) The propagation of log dir failure can be delayed due to slowness in closing the file handles

Reply via email to