[ https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914670#comment-16914670 ]
Meng Zhu commented on MESOS-9806: --------------------------------- Optimized the allocation loop {noformat} commit ec6b7b34215e821a63cb79e7d52d94ff08c1e110 Author: Meng Zhu <m...@mesosphere.io> Date: Thu Aug 22 17:54:25 2019 -0700 Optimized the allocation loop. Master: HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 23.37 secs Made 0 allocation in 19.72 secs Master + this patch: HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 16.831380548secs Made 0 allocation in 15.102885644secs Review: https://reviews.apache.org/r/71359 {noformat} > Address allocator performance regression due to the addition of quota limits. > ----------------------------------------------------------------------------- > > Key: MESOS-9806 > URL: https://issues.apache.org/jira/browse/MESOS-9806 > Project: Mesos > Issue Type: Improvement > Components: allocation > Reporter: Meng Zhu > Assignee: Meng Zhu > Priority: Critical > Labels: resource-management > > In MESOS-9802, we removed the quota role sorter which is tech debt. > However, this slows down the allocator. The problem is that in the first > stage, even though a cluster might have no active roles with non-default > quota, the allocator will now have to sort and go through each and every role > in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, > the allocator could experience ~50% performance degradation. > There are a couple of ways to address this issue. For example, we could make > the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return > all the roles with non-default quota. Alternatively, an even better approach > would be to deprecate the sorter concept and just have two standalone > functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree > structure (not yet exist in the allocator) and return the sorted roles. > In addition, when implementing MESOS-8068, we need to do more during the > allocation cycle. In particular, we need to call shrink many more times than > before. These all contribute to the performance slowdown. Specifically, for > the quota oriented benchmark > `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe > 2-3x slowdown compared to the previous release (1.8.1): > Current master: > QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 > Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter > Made 3500 allocations in 32.051382735secs > Made 0 allocation in 27.976022773secs > 1.8.1: > HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 > Made 3500 allocations in 13.810811063secs > Made 0 allocation in 9.885972984secs -- This message was sent by Atlassian Jira (v8.3.2#803003)