Luke Chen created KAFKA-16709:
---------------------------------

             Summary: move logDir within broker might cause log cleanup hanging
                 Key: KAFKA-16709
                 URL: https://issues.apache.org/jira/browse/KAFKA-16709
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 3.7.0
            Reporter: Luke Chen
            Assignee: Luke Chen


When doing alter replica logDirs, we'll create a future log and pause log 
cleaning for the partition( 
[here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/server/ReplicaManager.scala#L1200]).
 And this log cleaning pausing will resume after alter replica logDirs 
completes 
([here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/log/LogManager.scala#L1254]).
 And when in the resuming log cleaning, we'll decrement 1 for the 
LogCleaningPaused count. Once the count reached 0, the cleaning pause is really 
resuming. 
([here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/log/LogCleanerManager.scala#L310]).
 For more explanation about the logCleaningPaused state can check 
[here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/log/LogCleanerManager.scala#L55].

 

But, there's still one factor that could increase the LogCleaningPaused count: 
leadership change 
([here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/server/ReplicaManager.scala#L2126]).
 When there's a leadership change, we'll check if there's a future log in this 
partition, if so, we'll create future log and pauseCleaning (LogCleaningPaused 
count + 1). So, if during the alter replica logDirs:
 # alter replica logDirs for tp0 triggered (LogCleaningPaused count = 1)
 # tp0 leadership changed (LogCleaningPaused count = 2)
 # alter replica logDirs completes, resuming logCleaning (LogCleaningPaused 
count = 1)
 # LogCleaning keeps paused because the count is always >  0

 

The log cleaning is not just related to compacting logs, but also affecting the 
normal log retention processing, which means, the log retention for these 
paused partitions will be pending. This issue can be fixed when broker 
restarted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to