> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Friday, February 12, 2021 8:55 AM
> To: Peter Zijlstra <pet...@infradead.org>; Song Bao Hua (Barry Song)
> <song.bao....@hisilicon.com>
> Cc: vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> <xuw...@huawei.com>; Liguozhu (Kenneth) <liguo...@hisilicon.com>; tiantao (H)
> <tiant...@hisilicon.com>; wanghuiqiang <wanghuiqi...@huawei.com>; Zengtao (B)
> <prime.z...@hisilicon.com>; Jonathan Cameron <jonathan.came...@huawei.com>;
> guodong...@linaro.org; Meelis Roos <mr...@linux.ee>
> Subject: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't
> span domain->span for NUMA diameter > 2
> 
> On 10/02/21 12:21, Peter Zijlstra wrote:
> > On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
> >> So historically, the code has never tried to make sched_groups result
> >> in equal size. And it permits the overlapping of local group and remote
> >> groups.
> >
> > Histrorically groups have (typically) always been the same size though.
> >
> > The reason I did ask is because when you get one large and a bunch of
> > smaller groups, the load-balancing 'pull' is relatively smaller to the
> > large groups.
> >
> > That is, IIRC should_we_balance() ensures only 1 CPU out of the group
> > continues the load-balancing pass. So if, for example, we have one group
> > of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
> > 1/2 times, while the group of 4 CPUs will pull 1/4 times.
> >
> > By making sure all groups are of the same level, and thus of equal size,
> > this doesn't happen.
> 
> So I hacked something that tries to do this, with the notable exception
> that it doesn't change the way the local group is generated. Breaking the
> assumption that the local group always spans the child domain doesn't sound
> like the best thing to do.
> 
> Anywho, the below makes it so all !local NUMA groups are built using the
> same sched_domain_topology_level. Some of it is absolutely disgusting
> (esp. the sched_domains_curr_level stuff), I didn't bother with handling
> domain degeneration yet, and I trigger the WARN_ON in get_group()... But at
> least the topology gets built!
> 
> AFAICT Barry's topology is handled the same. On that sunfire topology, it
> pretty much turns all remote groups into groups spanning a single
> node. That does almost double the number of groups for some domains,
> compared to Barry's current v3.
> 
> If that is really a route we want to go down, I'll try to clean the below.
> 
Hi Valentin,

I understand Peter's concern is that the local group has different
size with remote groups. Is this patch resolving Peter's concern?
To me, it seems not :-)

Though I don’t understand why different group sizes will be harmful
since all groups are calculating avg_load and group_type based on
their own capacities. Thus, for a smaller group, its capacity would
be smaller.

Is it because a bigger group has relatively less chance to pull, so
load balancing will be completed more slowly while small groups have
high load?

> (deposit your drinks before going any further)
> --->8---
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..e74f48bdd226 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -187,7 +187,10 @@ struct sched_domain_topology_level {
>       sched_domain_mask_f mask;
>       sched_domain_flags_f sd_flags;
>       int                 flags;
> +#ifdef CONFIG_NUMA
>       int                 numa_level;
> +     int                 numa_sibling_level;
> +#endif
>       struct sd_data      data;
>  #ifdef CONFIG_SCHED_DEBUG
>       char                *name;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 3c50cc7285c9..5a9e6a4d5d89 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -742,6 +742,34 @@ enum s_alloc {
>       sa_none,
>  };
> 
> +/*
> + * Topology list, bottom-up.
> + */
> +static struct sched_domain_topology_level default_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +     { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> +#endif
> +#ifdef CONFIG_SCHED_MC
> +     { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> +#endif
> +     { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +     { NULL, },
> +};
> +
> +static struct sched_domain_topology_level *sched_domain_topology =
> +     default_topology;
> +
> +#define for_each_sd_topology(tl)                     \
> +     for (tl = sched_domain_topology; tl->mask; tl++)
> +
> +void set_sched_topology(struct sched_domain_topology_level *tl)
> +{
> +     if (WARN_ON_ONCE(sched_smp_initialized))
> +             return;
> +
> +     sched_domain_topology = tl;
> +}
> +
>  /*
>   * Return the canonical balance CPU for this group, this is the first CPU
>   * of this group that's also in the balance mask.
> @@ -955,10 +983,12 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
>       struct sched_group *first = NULL, *last = NULL, *sg;
>       const struct cpumask *span = sched_domain_span(sd);
>       struct cpumask *covered = sched_domains_tmpmask;
> +     struct sched_domain_topology_level *tl;
>       struct sd_data *sdd = sd->private;
>       struct sched_domain *sibling;
>       int i;
> 
> +     tl = container_of(sd->private, struct sched_domain_topology_level, 
> data);
>       cpumask_clear(covered);
> 
>       for_each_cpu_wrap(i, span, cpu) {
> @@ -982,6 +1012,10 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
>               if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
>                       continue;
> 
> +             /* Don't mess up local group being child sched domain */
> +             if (tl->numa_sibling_level != sd->level && i != cpu)
> +                     sibling =
> *per_cpu_ptr(sched_domain_topology[tl->numa_sibling_level].data.sd, i);
> +
>               sg = build_group_from_child_sched_domain(sibling, cpu);
>               if (!sg)
>                       goto fail;
> @@ -989,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
>               sg_span = sched_group_span(sg);
>               cpumask_or(covered, covered, sg_span);
> 
> -             init_overlap_sched_group(sd, sg);
> +             init_overlap_sched_group(sibling, sg);
> 
>               if (!first)
>                       first = sg;
> @@ -1440,34 +1474,6 @@ sd_init(struct sched_domain_topology_level *tl,
>       return sd;
>  }
> 
> -/*
> - * Topology list, bottom-up.
> - */
> -static struct sched_domain_topology_level default_topology[] = {
> -#ifdef CONFIG_SCHED_SMT
> -     { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> -#endif
> -#ifdef CONFIG_SCHED_MC
> -     { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> -#endif
> -     { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> -     { NULL, },
> -};
> -
> -static struct sched_domain_topology_level *sched_domain_topology =
> -     default_topology;
> -
> -#define for_each_sd_topology(tl)                     \
> -     for (tl = sched_domain_topology; tl->mask; tl++)
> -
> -void set_sched_topology(struct sched_domain_topology_level *tl)
> -{
> -     if (WARN_ON_ONCE(sched_smp_initialized))
> -             return;
> -
> -     sched_domain_topology = tl;
> -}
> -
>  #ifdef CONFIG_NUMA
> 
>  static const struct cpumask *sd_numa_mask(int cpu)
> @@ -1566,6 +1572,52 @@ static void init_numa_topology_type(void)
> 
>  #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
> 
> +void sched_init_numa_siblings(void)
> +{
> +     struct sched_domain_topology_level *tl;
> +     int tl_id = 0, sibling_tl_id = 0;
> +     const struct cpumask *mask;
> +     int last_written_tl = 0;
> +     int n, i, j;
> +
> +     for_each_sd_topology(tl) {
> +             if (!tl->numa_level)
> +                     goto next;
> +
> +             for_each_node(n) {
> +                     /* XXX: this *thing* needs to die */
> +                     sched_domains_curr_level = tl->numa_level;
> +                     i = cpumask_first(cpumask_of_node(n));
> +                     mask = tl->mask(i);
> +
> +                     for_each_cpu_wrap(j, mask, i) {
> +                             sibling_tl_id = tl_id;
> +                             sched_domains_curr_level = tl->numa_level;
> +
> +                             /* Not subset? Down we go! */
> +                             /* XXX bother about degeneration here? */
> +                             do {
> +                                     sibling_tl_id--;
> +                                     sched_domains_curr_level--;
> +                             } while (sibling_tl_id > 0 &&
> +
>        !cpumask_subset(sched_domain_topology[sibling_tl_id].mask(j), mask));
> +
> +                             /* Only write if first write or it lowers level 
> */
> +                             if (last_written_tl < tl_id ||
> +                                 tl->numa_sibling_level > sibling_tl_id) {
> +                                     tl->numa_sibling_level = sibling_tl_id;
> +                                     last_written_tl = tl_id;
> +                             }
> +                     }
> +             }
> +next:
> +             if (last_written_tl < tl_id)
> +                     tl->numa_sibling_level = tl_id;
> +
> +             tl_id++;
> +     }
> +}
> +
>  void sched_init_numa(void)
>  {
>       struct sched_domain_topology_level *tl;
> @@ -1708,6 +1760,8 @@ void sched_init_numa(void)
>       sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
> 
>       init_numa_topology_type();
> +
> +     sched_init_numa_siblings();
>  }
> 
>  void sched_domains_numa_masks_set(unsigned int cpu)


Thanks
Barry

Reply via email to