Dario Rexin created MESOS-4694:
----------------------------------

             Summary: DRFAllocator takes very long to allocate resources with a 
large number of frameworks
                 Key: MESOS-4694
                 URL: https://issues.apache.org/jira/browse/MESOS-4694
             Project: Mesos
          Issue Type: Bug
          Components: allocation
    Affects Versions: 0.26.0, 0.27.0, 0.27.1
            Reporter: Dario Rexin
            Assignee: Dario Rexin


With a growing number of connected frameworks, the allocation time grows to 
very high numbers. The addition of quota in 0.27 had an additional impact on 
these numbers. Running `mesos-tests.sh --benchmark 
--gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us the 
following numbers:

{noformat}
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 2000 slaves and 200 frameworks
round 0 allocate took 2.921202secs to make 200 offers
round 1 allocate took 2.85045secs to make 200 offers
round 2 allocate took 2.823768secs to make 200 offers
{noformat}

Increasing the number of frameworks to 2000:

{noformat}
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 2000 slaves and 2000 frameworks
round 0 allocate took 28.209454secs to make 2000 offers
round 1 allocate took 28.469419secs to make 2000 offers
round 2 allocate took 28.138086secs to make 2000 offers
{noformat}

I was able to reduce this time by a substantial amount. After applying the 
patches:

{noformat}
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 2000 slaves and 200 frameworks
round 0 allocate took 1.016226secs to make 2000 offers
round 1 allocate took 1.102729secs to make 2000 offers
round 2 allocate took 1.102624secs to make 2000 offers
{noformat}

And with 2000 frameworks:

{noformat}
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 2000 slaves and 2000 frameworks
round 0 allocate took 12.563203secs to make 2000 offers
round 1 allocate took 12.437517secs to make 2000 offers
round 2 allocate took 12.470708secs to make 2000 offers
{noformat}

The patches do 3 things to improve the performance of the allocator.

1) The total values in the DRFSorter will be pre calculated per resource type

2) In the allocate method, when no resources are available to allocate, we 
break out of the innermost loop to prevent looping over a large number of 
frameworks when we have nothing to allocate

3) when a framework suppresses offers, we remove it from the sorter instead of 
just calling continue in the allocation loop - this greatly improves 
performance in the sorter and prevents looping over frameworks that don't need 
resources

Assuming that most of the frameworks behave nicely and suppress offers when 
they have nothing to schedule, it is fair to assume, that point 3) has the 
biggest impact on the performance. If we suppress offers for 90% of the 
frameworks in the benchmark test, we see following numbers:

{noformat}
==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 200 slaves and 2000 frameworks
round 0 allocate took 11626us to make 200 offers
round 1 allocate took 22890us to make 200 offers
round 2 allocate took 21346us to make 200 offers
{noformat}

And for 200 frameworks:

{noformat}
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 2000 slaves and 2000 frameworks
round 0 allocate took 1.11178secs to make 2000 offers
round 1 allocate took 1.062649secs to make 2000 offers
round 2 allocate took 1.080181secs to make 2000 offers
{noformat}

Review requests:

https://reviews.apache.org/r/43665/
https://reviews.apache.org/r/43666/
https://reviews.apache.org/r/43668/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to