drawxy commented on code in PR #13847:
URL: https://github.com/apache/kafka/pull/13847#discussion_r1243863992


##########
core/src/main/scala/kafka/server/ReplicaManager.scala:
##########
@@ -1808,7 +1809,10 @@ class ReplicaManager(val config: KafkaConfig,
 
           // pause cleaning for partitions that are being moved and start 
ReplicaAlterDirThread to move
           // replica from source dir to destination dir
-          logManager.abortAndPauseCleaning(topicPartition)
+          replicaAlterLogDirsManager.getFetcher(topicPartition) match {

Review Comment:
   Recently, I deployed this change to my cluster and found that this change 
didn't work in the scenario where an exception was raised. During altering, the 
partition being altered log dir would be marked as failed due to exceptions and 
the fetcher of it would be removed **without resuming the cleaning**. After 
receiving a LeaderAndISR request of that partition, the altering would be 
restarted by assign a fetcher to that partition, and the cleaning of that 
partition would be paused one more time due to no assigned fetcher. Therefore, 
the cleaning will never be resumed (pause twice, but resume once after altering 
completed).
   
   I pushed another commit. Every time assign fetcher to partition, pause the 
cleaning of that partition. And every time remove the fetcher, resume the 
cleaning of that partition. To make sure that resuming as many times as 
pausing. If this change make sense to you, I will add unit test later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to