yuanlihan opened a new issue #11256: URL: https://github.com/apache/druid/issues/11256
### Motivation The periodic task of the Coordinator service could be slow in a large cluster. It takes about 5 minutes to finish in a cycle. The periodic task consists of several serial subtasks. According to the profiling result, the segment balance task has some performance issue. I found that the root cause is that the current implementation invokes the sampling method too many times. We can reduce the number of method invocations by increasing the sample size in each invocation. <img width="1080" alt="image" src="https://user-images.githubusercontent.com/44718283/118240481-a5ecff80-b4cd-11eb-91b2-e310e0fa91ac.png"> ### Proposed changes Adding a new Reservoir Sample method to sample K elements each time instead of only one element each time. A default method `pickSegmentsToMove` will be added to interface BalancerStrategy to pick K segments to move in a single method invocation. ### Rationale The current implementation picks up only one segment each time iterating all segments. When there are a lot segments need to be rebalanced or need to be decommissioned, the load balance calculation will be really slow. By picking up K segments each time will significantly reduce the number of iteration and thus speed up the process. ### Operational impact There will be no impact in operation ### Test plan (optional) Ensure test coverage and test it in test cluster -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org