fair: Consider asymmetric scheduler groups in load balancer

Tobias Huschle Mon, 05 Jun 2023 01:07:50 -0700

On 2023-05-16 15:36, Vincent Guittot wrote:

On Mon, 15 May 2023 at 13:46, Tobias Huschle <husc...@linux.ibm.com>wrote:

The current load balancer implementation implies that schedulergroups,

within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not
of the same size.

The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390
architectures. Nevertheless, s390 can have scheduler groups of unequal
size.

This introduces a performance degredation in the following scenario:

Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
the remaining 2 are located on another socket:

Socket   -----1-----    -2-
CPU      1 2 3 4 5 6    7 8

Placing some workload ( x = one task ) yields the following
scenarios:

The first 5 tasks are distributed evenly across the two groups.

Socket   -----1-----    -2-
CPU      1 2 3 4 5 6    7 8
         x x x          x x

Adding a 6th task yields the following distribution:

Socket   -----1-----    -2-
CPU      1 2 3 4 5 6    7 8
SMT1     x x x          x x
SMT2                    x


Your description is a bit confusing for me. What you name CPU above
should be named Core, doesn' it ?

Could you share with us your scheduler topology ?


You are correct, it should say core instead of CPU.

One actual configuration from one of my test machines (lscpu -e):

CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGUREDPOLARIZATION ADDRESS0 0 0 0 0 0 0:0:0 yes yes horizontal01 0 0 0 0 0 1:1:0 yes yes horizontal12 0 0 0 0 1 2:2:0 yes yes horizontal23 0 0 0 0 1 3:3:0 yes yes horizontal34 0 0 0 0 2 4:4:0 yes yes horizontal45 0 0 0 0 2 5:5:0 yes yes horizontal56 0 0 0 0 3 6:6:0 yes yes horizontal67 0 0 0 0 3 7:7:0 yes yes horizontal78 0 0 0 0 4 8:8:0 yes yes horizontal89 0 0 0 0 4 9:9:0 yes yes horizontal910 0 0 0 0 5 10:10:0 yes yes horizontal1011 0 0 0 0 5 11:11:0 yes yes horizontal1112 0 0 0 1 6 12:12:0 yes yes horizontal1213 0 0 0 1 6 13:13:0 yes yes horizontal1314 0 0 0 1 7 14:14:0 yes yes horizontal1415 0 0 0 1 7 15:15:0 yes yes horizontal15


So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

If I run stress-ng with 8 cpu stressors on the original code I get adistribution

like this:

00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
                x     x     x     x      x  x  x  x

Which means that the two cores in the smaller group are running into SMTwhile twocores in the larger group are still idle. This is caused by theprefer_sibling path

which really wants to see both groups run the same number of tasks.


The task is added to the 2nd scheduler group, as the scheduler has the
assumption that scheduler groups are of the same size, so they should
also host the same number of tasks. This makes CPU 7 run into SMT
thread, which comes with a performance penalty. This means, that in
the window of 6-8 tasks, load balancing is done suboptimally, because
SMT is used although there is no reason to do so as fully idle CPUs
are still available.

Taking the weight of the scheduler groups into account, ensures that
a load balancing CPU within a smaller group will not try to pull tasks
from a bigger group while the bigger group still has idle CPUs
available.

Signed-off-by: Tobias Huschle <husc...@linux.ibm.com>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..b1307d7e4065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

@@ -10426,7 +10426,8 @@ static struct sched_group*find_busiest_group(struct lb_env *env)

         * group's child domain.
         */

if (sds.prefer_sibling && local->group_type == group_has_spare&&

-           busiest->sum_nr_running > local->sum_nr_running + 1)
+           busiest->sum_nr_running * local->group_weight >

+ local->sum_nr_running * busiest->group_weight+ 1)


This is the prefer_sibling path. Could it be that you should disable
prefer_siling between your sockets for such topology ? the default
path compares the number of idle CPUs when groups has spare capacity

If I disable prefer_sibling (I played around with it a bit), I run intothe problem,that the tasks are distributed s.t. each group has the same amount ofidle CPUs, which

yields distributions similar to this:

00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
    x  x  x     x  x     x     x  x

Now both groups have 4 idle CPUs which fulfills the criteria imposed bythe load balancer,but the larger group is now running SMT while the smaller one is justidle.

So, in this asymmetric setup, both criteria seem to not work in anoptimal way. Going forthe same number of idle CPUs or alternatively for the same number ofrunning processesboth cause a sub-optimal distribution of tasks, leading to unnecessarySMT.

It seems also to be possible to address the regular load balancing pathby aiming to have thesame unused capacity between groups instead of the same number of idleCPUs. This seems tohave been considered in the past, but the choice went in favor of thenumber of idle CPUs.Since this decision was actively taken already, I focused on theprefer_sibling path.

The question now would be how to address this properly (or if I'mmissing something here).As mentioned in the cover letter, this was the most simplistic and leastinvasive approachI could find, others might be more sophisticated but also have someside-effects.

I have a bit of a hard time leaving this one as-is, as it justintroduces two additionalmultiplications with no effect for most architectures. Maybe anarchitectures specific

inline function that the compiler can optimize away if not needed?

                goto force_balance;

        if (busiest->group_type != group_overloaded) {
--
2.34.1

Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

Reply via email to