[ https://issues.apache.org/jira/browse/KAFKA-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luke Chen resolved KAFKA-16709. ------------------------------- Resolution: Fixed > alter logDir within broker might cause log cleanup hanging > ---------------------------------------------------------- > > Key: KAFKA-16709 > URL: https://issues.apache.org/jira/browse/KAFKA-16709 > Project: Kafka > Issue Type: Bug > Affects Versions: 3.7.0 > Reporter: Luke Chen > Assignee: Luke Chen > Priority: Major > Fix For: 3.8.0 > > > When doing alter replica logDirs, we'll create a future log and pause log > cleaning for the partition( > [here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/server/ReplicaManager.scala#L1200]). > And this log cleaning pausing will resume after alter replica logDirs > completes > ([here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/log/LogManager.scala#L1254]). > And when in the resuming log cleaning, we'll decrement 1 for the > LogCleaningPaused count. Once the count reached 0, the cleaning pause is > really resuming. > ([here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/log/LogCleanerManager.scala#L310]). > For more explanation about the logCleaningPaused state can check > [here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/log/LogCleanerManager.scala#L55]. > > But, there's still one factor that could increase the LogCleaningPaused > count: leadership change > ([here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/server/ReplicaManager.scala#L2126]). > When there's a leadership change, we'll check if there's a future log in > this partition, if so, we'll create future log and pauseCleaning > (LogCleaningPaused count + 1). So, if during the alter replica logDirs: > # alter replica logDirs for tp0 triggered (LogCleaningPaused count = 1) > # tp0 leadership changed (LogCleaningPaused count = 2) > # alter replica logDirs completes, resuming logCleaning (LogCleaningPaused > count = 1) > # LogCleaning keeps paused because the count is always > 0 > > The log cleaning is not just related to compacting logs, but also affecting > the normal log retention processing, which means, the log retention for these > paused partitions will be pending. This issue can be fixed when broker > restarted. -- This message was sent by Atlassian Jira (v8.20.10#820010)