Jun Rao created KAFKA-7836: ------------------------------ Summary: The propagation of log dir failure can be delayed due to slowness in closing the file handles Key: KAFKA-7836 URL: https://issues.apache.org/jira/browse/KAFKA-7836 Project: Kafka Issue Type: Improvement Reporter: Jun Rao
In ReplicaManager.handleLogDirFailure(), we call zkClient.propagateLogDirEvent after logManager.handleLogDirFailure. The latter closes the file handles of the offline replicas, which could take time when the disk is bad. This will delay the new leader election by the controller. In one incident, we have seen the closing of file handles of multiple replicas taking more than 20 seconds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)