Lucas Wang created KAFKA-6630:
---------------------------------

             Summary: Speed up the processing of StopReplicaResponse events on 
the controller
                 Key: KAFKA-6630
                 URL: https://issues.apache.org/jira/browse/KAFKA-6630
             Project: Kafka
          Issue Type: Improvement
          Components: core
            Reporter: Lucas Wang
            Assignee: Lucas Wang


Problem Statement:
We find in a large cluster with many partition replicas, it takes a long time 
to successfully delete a topic. 

Root cause:
Further analysis shows that for a topic with N replicas, the controller 
receives all the N StopReplicaResponses from brokers within a short time, 
however sequentially handling all the N 
TopicDeletionStopReplicaResponseReceived events one by one takes a long time.

Specifically the functions triggered while handling every single 
TopicDeletionStopReplicaResponseReceived event include:
TopicDeletionStopReplicaResponseReceived.process calls 
TopicDeletionManager.completeReplicaDeletion, which calls 
TopicDeletionManager.resumeDeletions, which calls several inefficient functions.

The inefficient functions called inside TopicDeletionManager.resumeDeletions 
include
ReplicaStateMachine.areAllReplicasForTopicDeleted
ReplicaStateMachine.isAtLeastOneReplicaInDeletionStartedState
ReplicaStateMachine.replicasInState

Each of the 3 inefficient functions above will iterate through all the replicas 
in the cluster, and filter out the replicas belonging to a topic. In a large 
cluster with many replicas, these functions can be quite slow. 

Total deletion time for a topic becomes long in single threaded controller 
processing model:
Since the controller needs to sequentially process the queued 
TopicDeletionStopReplicaResponseReceived events, if the time cost to process 
one event is t, the total time to process all events for all replicas of a 
topic is N * t.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to