Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit : > On 2023-05-16 15:36, Vincent Guittot wrote: > > On Mon, 15 May 2023 at 13:46, Tobias Huschle <husc...@linux.ibm.com> > > wrote: > > > > > > The current load balancer implementation implies that scheduler > > > groups, > > > within the same domain, all host the same number of CPUs. This is > > > reflected in the condition, that a scheduler group, which is load > > > balancing and classified as having spare capacity, should pull work > > > from the busiest group, if the local group runs less processes than > > > the busiest one. This implies that these two groups should run the > > > same number of processes, which is problematic if the groups are not > > > of the same size. > > > > > > The assumption that scheduler groups within the same scheduler domain > > > host the same number of CPUs appears to be true for non-s390 > > > architectures. Nevertheless, s390 can have scheduler groups of unequal > > > size. > > > > > > This introduces a performance degredation in the following scenario: > > > > > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, > > > the remaining 2 are located on another socket: > > > > > > Socket -----1----- -2- > > > CPU 1 2 3 4 5 6 7 8 > > > > > > Placing some workload ( x = one task ) yields the following > > > scenarios: > > > > > > The first 5 tasks are distributed evenly across the two groups. > > > > > > Socket -----1----- -2- > > > CPU 1 2 3 4 5 6 7 8 > > > x x x x x > > > > > > Adding a 6th task yields the following distribution: > > > > > > Socket -----1----- -2- > > > CPU 1 2 3 4 5 6 7 8 > > > SMT1 x x x x x > > > SMT2 x > > > > Your description is a bit confusing for me. What you name CPU above > > should be named Core, doesn' it ? > > > > Could you share with us your scheduler topology ? > > > > You are correct, it should say core instead of CPU. > > One actual configuration from one of my test machines (lscpu -e): >
[...] > > So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. Thaks for the details > > If I run stress-ng with 8 cpu stressors on the original code I get a > distribution > like this: > > 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 > x x x x x x x x > > Which means that the two cores in the smaller group are running into SMT > while two > cores in the larger group are still idle. This is caused by the > prefer_sibling path > which really wants to see both groups run the same number of tasks. yes and it considers that there are the same number of CPUs per group > > > > > > > The task is added to the 2nd scheduler group, as the scheduler has the > > > assumption that scheduler groups are of the same size, so they should > > > also host the same number of tasks. This makes CPU 7 run into SMT > > > thread, which comes with a performance penalty. This means, that in > > > the window of 6-8 tasks, load balancing is done suboptimally, because > > > SMT is used although there is no reason to do so as fully idle CPUs > > > are still available. > > > > > > Taking the weight of the scheduler groups into account, ensures that > > > a load balancing CPU within a smaller group will not try to pull tasks > > > from a bigger group while the bigger group still has idle CPUs > > > available. > > > > > > Signed-off-by: Tobias Huschle <husc...@linux.ibm.com> > > > --- > > > kernel/sched/fair.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 48b6f0ca13ac..b1307d7e4065 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -10426,7 +10426,8 @@ static struct sched_group > > > *find_busiest_group(struct lb_env *env) > > > * group's child domain. > > > */ > > > if (sds.prefer_sibling && local->group_type == > > > group_has_spare && > > > - busiest->sum_nr_running > local->sum_nr_running + 1) > > > + busiest->sum_nr_running * local->group_weight > > > > + local->sum_nr_running * > > > busiest->group_weight + 1) what you want to test here is that moving 1 task from busiest to local group would help and balance the ratio of tasks per cpu (busiest->sum_nr_running - 1) / busiest->group_weight > (local->sum_nr_running + 1) / local->group_weight which can be develop into busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight and you also have to change how we calculate the imbalance which just provide the half of the diff of nr_running by something like (busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight) > > > > This is the prefer_sibling path. Could it be that you should disable > > prefer_siling between your sockets for such topology ? the default > > path compares the number of idle CPUs when groups has spare capacity > > > > > > If I disable prefer_sibling (I played around with it a bit), I run into the > problem, > that the tasks are distributed s.t. each group has the same amount of idle > CPUs, which > yields distributions similar to this: > > 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 > x x x x x x x x > > Now both groups have 4 idle CPUs which fulfills the criteria imposed by the > load balancer, > but the larger group is now running SMT while the smaller one is just idle. > > So, in this asymmetric setup, both criteria seem to not work in an optimal > way. Going for > the same number of idle CPUs or alternatively for the same number of running > processes > both cause a sub-optimal distribution of tasks, leading to unnecessary SMT. there is the same behavior and assumption here too > > It seems also to be possible to address the regular load balancing path by > aiming to have the > same unused capacity between groups instead of the same number of idle CPUs. > This seems to > have been considered in the past, but the choice went in favor of the number > of idle CPUs. unused capacity doesn't give the instantaneous state so a group can be idle but without unused capacity > Since this decision was actively taken already, I focused on the > prefer_sibling path. > > The question now would be how to address this properly (or if I'm missing > something here). > As mentioned in the cover letter, this was the most simplistic and least > invasive approach > I could find, others might be more sophisticated but also have some > side-effects. > > I have a bit of a hard time leaving this one as-is, as it just introduces > two additional > multiplications with no effect for most architectures. Maybe an > architectures specific > inline function that the compiler can optimize away if not needed? > > > > goto force_balance; > > > > > > if (busiest->group_type != group_overloaded) { > > > -- > > > 2.34.1 > > > >