hudeqi created KAFKA-14824: ------------------------------ Summary: ReplicaAlterLogDirsThread may cause serious disk usage in case of unknown exception Key: KAFKA-14824 URL: https://issues.apache.org/jira/browse/KAFKA-14824 Project: Kafka Issue Type: Bug Components: core Affects Versions: 3.3.2 Reporter: hudeqi
For ReplicaAlterLogDirsThread, if the partition is marked as failed due to an unknown exception and the partition fetch is suspended, the paused cleanup logic of the partition needs to be canceled, otherwise it will lead to serious unexpected disk usage growth. For example, in the actual production environment (the Kafka version used is 2.5.1), there is such a case: perform log dir balance on this partition leader broker. After started fetching when the future log is successfully created, then reset and truncate to the leader's log start offset for the first time due to out of range. At the same time, because the partition leader is processing the leaderAndIsrRequest, the leader epoch is updated, so the ReplicaAlterLogDirsThread appears FENCED_LEADER_EPOCH, and the 'partitionStates' of the partition are cleaned up. At the same time, the logic of add ReplicaAlterLogDirsThread for the partition is executing in the thread that is processing leaderAndIsrRequest. In here, the offset set by InitialFetchState is the hw of the leader. When ReplicaAlterLogDirsThread performs the logic of processFetchRequest, it will throw "java.lang.IllegalStateException : Offset mismatch for the future replica anti_fraud.data_collector.anticrawler_live-54: fetched offset = 4979659327, log end offset = 4918576434.", leading to such a result: ReplicaAlterLogDirsThread no longer fetch the partition, due to the previous paused cleanup logic of the partition, the disk usage of the corresponding broker increases infinitely, causing serious problems. But I found that trunk fixed this bug in KAFKA-9087, which may cause ReplicaAlterLogDirsThread to appear “Offset mismatch error" causing to stop fetch. But I don't know if there will be some other unknown exceptions, and at the same time, due to the current logic, it will bring the same disk cleanup failure problem? -- This message was sent by Atlassian Jira (v8.20.10#820010)