drawxy commented on code in PR #13847: URL: https://github.com/apache/kafka/pull/13847#discussion_r1243863992
########## core/src/main/scala/kafka/server/ReplicaManager.scala: ########## @@ -1808,7 +1809,10 @@ class ReplicaManager(val config: KafkaConfig, // pause cleaning for partitions that are being moved and start ReplicaAlterDirThread to move // replica from source dir to destination dir - logManager.abortAndPauseCleaning(topicPartition) + replicaAlterLogDirsManager.getFetcher(topicPartition) match { Review Comment: Recently, I deployed this change to my cluster and found that this change didn't work in the scenario where an exception was raised. During altering, the partition being altered log dir would be marked as failed due to exceptions and the fetcher of it would be removed **without resuming the cleaning**. After receiving a LeaderAndISR request of that partition, the altering would be restarted by assign a fetcher to that partition, and the cleaning of that partition would be paused one more time due to no assigned fetcher. Therefore, the cleaning will never be resumed (pause twice, but resume once after altering completed). I pushed another commit. Every time assign fetcher to partition, pause the cleaning of that partition. And every time remove the fetcher, resume the cleaning of that partition. To make sure that resuming as many times as pausing. If this change make sense to you, I will add unit test later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org