> -----Original Message-----
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Thursday, February 11, 2021 12:22 AM
> To: Song Bao Hua (Barry Song) <song.bao....@hisilicon.com>
> Cc: valentin.schnei...@arm.com; vincent.guit...@linaro.org; mgor...@suse.de;
> mi...@kernel.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> <xuw...@huawei.com>; Liguozhu (Kenneth) <liguo...@hisilicon.com>; tiantao (H)
> <tiant...@hisilicon.com>; wanghuiqiang <wanghuiqi...@huawei.com>; Zengtao (B)
> <prime.z...@hisilicon.com>; Jonathan Cameron <jonathan.came...@huawei.com>;
> guodong...@linaro.org; Meelis Roos <mr...@linux.ee>
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
> 
> > > I've finally had a moment to think about this, would it make sense to
> > > also break up group: node0+1, such that we then end up with 3 groups of
> > > equal size?
> >
> 
> > Since the sched_domain[n-1] of a part of node[m]'s siblings are able
> > to cover the whole span of sched_domain[n] of node[m], there is no
> > necessity to scan over all siblings of node[m], once sched_domain[n]
> > of node[m] has been covered, we can stop making more sched_groups. So
> > the number of sched_groups is small.
> >
> > So historically, the code has never tried to make sched_groups result
> > in equal size. And it permits the overlapping of local group and remote
> > groups.
> 
> Histrorically groups have (typically) always been the same size though.

This is probably true for other platforms. But unfortunately it has never
been true in my platform :-)

node   0   1   2   3 
  0:  10  12  20  22 
  1:  12  10  22  24 
  2:  20  22  10  12 
  3:  22  24  12  10

In case we have only two cpus in one numa. 

CPU0's domain-3 has no overflowed sched_group, but its first group
covers 0-5(node0-node2), the second group covers 4-7
(node2-node3):

[    0.802139] CPU0 attaching sched-domain(s):
[    0.802193]  domain-0: span=0-1 level=MC
[    0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[    0.802693]   domain-1: span=0-3 level=NUMA
[    0.802731]    groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[    0.802811]    domain-2: span=0-5 level=NUMA
[    0.802829]     groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[    0.802881] ERROR: groups don't span domain->span
[    0.803058]     domain-3: span=0-7 level=NUMA
[    0.803080]      groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }


> 
> The reason I did ask is because when you get one large and a bunch of
> smaller groups, the load-balancing 'pull' is relatively smaller to the
> large groups.
> 
> That is, IIRC should_we_balance() ensures only 1 CPU out of the group
> continues the load-balancing pass. So if, for example, we have one group
> of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
> 1/2 times, while the group of 4 CPUs will pull 1/4 times.
> 
> By making sure all groups are of the same level, and thus of equal size,
> this doesn't happen.

As you can see, even if we give all groups of domain2 equal size
by breaking up both local_group and remote_groups,  we will get to
the same problem in domain-3. And what's more tricky is that
domain-3 has no problem of "groups don't span domain->span".

It seems we need to change both domain2 and domain3 then though
domain3 has no issue of "groups don't span domain->span".

Thanks
Barry

Reply via email to