[ https://issues.apache.org/jira/browse/KAFKA-14824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
hudeqi updated KAFKA-14824: --------------------------- Summary: ReplicaAlterLogDirsThread may cause serious disk growing in case of unknown exception (was: ReplicaAlterLogDirsThread may cause serious disk usage in case of unknown exception) > ReplicaAlterLogDirsThread may cause serious disk growing in case of unknown > exception > ------------------------------------------------------------------------------------- > > Key: KAFKA-14824 > URL: https://issues.apache.org/jira/browse/KAFKA-14824 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 3.3.2 > Reporter: hudeqi > Priority: Blocker > Attachments: 1.png, 2.png, 3.png, 4.png > > > For ReplicaAlterLogDirsThread, if the partition is marked as failed due to an > unknown exception and the partition fetch is suspended, the paused cleanup > logic of the partition needs to be canceled, otherwise it will lead to > serious unexpected disk usage growth. > > For example, in the actual production environment (the Kafka version used is > 2.5.1), there is such a case: perform log dir balance on this partition > leader broker. After started fetching when the future log is successfully > created, then reset and truncate to the leader's log start offset for the > first time due to out of range. At the same time, because the partition > leader is processing the leaderAndIsrRequest, the leader epoch is updated, so > the ReplicaAlterLogDirsThread appears FENCED_LEADER_EPOCH, and the > 'partitionStates' of the partition are cleaned up. At the same time, the > logic of add ReplicaAlterLogDirsThread for the partition is executing in the > thread that is processing leaderAndIsrRequest. In here, the offset set by > InitialFetchState is the hw of the leader. When ReplicaAlterLogDirsThread > performs the logic of processFetchRequest, it will throw > "java.lang.IllegalStateException : Offset mismatch for the future replica > anti_fraud.data_collector.anticrawler_live-54: fetched offset = 4979659327, > log end offset = 4918576434.", leading to such a result: > ReplicaAlterLogDirsThread no longer fetch the partition, due to the previous > paused cleanup logic of the partition, the disk usage of the corresponding > broker increases infinitely, causing serious problems. > > But I found that trunk fixed this bug in KAFKA-9087, which may cause > ReplicaAlterLogDirsThread to appear “Offset mismatch error" causing to stop > fetch. But I don't know if there will be some other unknown exceptions, and > at the same time, due to the current logic, it will bring the same disk > cleanup failure problem? -- This message was sent by Atlassian Jira (v8.20.10#820010)