yuanlihan opened a new issue #11256:
URL: https://github.com/apache/druid/issues/11256


   ### Motivation
   
   The periodic task of the Coordinator service could be slow in a large 
cluster. It takes about 5 minutes to finish in a cycle. The periodic task 
consists of several serial subtasks. According to the profiling result, the 
segment balance task has some performance issue. I found that the root cause is 
that the current implementation invokes the sampling method too many times. We 
can reduce the number of method invocations by increasing the sample size in 
each invocation.
   
   <img width="1080" alt="image" 
src="https://user-images.githubusercontent.com/44718283/118240481-a5ecff80-b4cd-11eb-91b2-e310e0fa91ac.png";>
   
   
   ### Proposed changes
   
   Adding a new Reservoir Sample method to sample K elements each time instead 
of only one element each time.
   A default method `pickSegmentsToMove` will be added to interface 
BalancerStrategy to pick K segments to move in a single method invocation.
   
   ### Rationale
   
   The current implementation picks up only one segment each time iterating all 
segments. When there are a lot segments need to be rebalanced or need to be 
decommissioned, the load balance calculation will be really slow. By picking up 
K segments each time will significantly reduce the number of iteration and thus 
speed up the process.
   
   ### Operational impact
   
   There will be no impact in operation
   
   ### Test plan (optional)
   
   Ensure test coverage and test it in test cluster
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to