smp: Add support detecting thread-groups sharing L2 cache

Gautham R Shenoy Tue, 08 Dec 2020 09:43:55 -0800

Hello Srikar,

On Mon, Dec 07, 2020 at 06:10:39PM +0530, Srikar Dronamraju wrote:
> * Gautham R. Shenoy <e...@linux.vnet.ibm.com> [2020-12-04 10:18:46]:
> 
> > From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com>
> > 
> > On POWER systems, groups of threads within a core sharing the L2-cache
> > can be indicated by the "ibm,thread-groups" property array with the
> > identifier "2".
> > 
> > This patch adds support for detecting this, and when present, populate
> > the populating the cpu_l2_cache_mask of every CPU to the core-siblings
> > which share L2 with the CPU as specified in the by the
> > "ibm,thread-groups" property array.
> > 
> > On a platform with the following "ibm,thread-group" configuration
> >              00000001 00000002 00000004 00000000
> >              00000002 00000004 00000006 00000001
> >              00000003 00000005 00000007 00000002
> >              00000002 00000004 00000000 00000002
> >              00000004 00000006 00000001 00000003
> >              00000005 00000007
> > 
> > Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
> >     CPU0 attaching sched-domain(s):
> >     domain-0: span=0,2,4,6 level=SMT
> >     domain-1: span=0-7 level=CACHE
> >     domain-2: span=0-15,24-39,48-55 level=MC
> >     domain-3: span=0-55 level=DIE
> > 
> >     CPU1 attaching sched-domain(s):
> >     domain-0: span=1,3,5,7 level=SMT
> >     domain-1: span=0-7 level=CACHE
> >     domain-2: span=0-15,24-39,48-55 level=MC
> >     domain-3: span=0-55 level=DIE
> > 
> > The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
> > sub-array
> > [00000002 00000002 00000004
> >  00000000 00000002 00000004 00000006
> >  00000001 00000003 00000005 00000007]
> > indicates that L2 (Property "2") is shared only between the threads of a 
> > single
> > group. There are "2" groups of threads where each group contains "4"
> > threads each. The groups being {0,2,4,6} and {1,3,5,7}.
> > 
> > With this patch, the sched-domain hierarchy for CPUs 0,1 would be
> >             CPU0 attaching sched-domain(s):
> >     domain-0: span=0,2,4,6 level=SMT
> >     domain-1: span=0-15,24-39,48-55 level=MC
> >     domain-2: span=0-55 level=DIE
> > 
> >     CPU1 attaching sched-domain(s):
> >     domain-0: span=1,3,5,7 level=SMT
> >     domain-1: span=0-15,24-39,48-55 level=MC
> >     domain-2: span=0-55 level=DIE
> > 
> > The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
> > resp.) gets degenerated into the SMT domain. Furthermore, the
> > last-level-cache domain gets correctly set to the SMT sched-domain.
> > 
> > Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kernel/smp.c | 66 
> > +++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 58 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 6a242a3..a116d2d 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -76,6 +76,7 @@
> >  struct task_struct *secondary_current;
> >  bool has_big_cores;
> >  bool coregroup_enabled;
> > +bool thread_group_shares_l2;
> 
> Either keep this as static in this patch or add its declaration
>


This will be used in Patch 3 in kernel/cacheinfo.c, but not any other
place. Hence I am not making it static here.


> > 
> >  DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
> >  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> > @@ -99,6 +100,7 @@ enum {
> > 
> >  #define MAX_THREAD_LIST_SIZE       8
> >  #define THREAD_GROUP_SHARE_L1   1
> > +#define THREAD_GROUP_SHARE_L2   2
> >  struct thread_groups {
> >     unsigned int property;
> >     unsigned int nr_groups;
> > @@ -107,7 +109,7 @@ struct thread_groups {
> >  };
> > 
> >  /* Maximum number of properties that groups of threads within a core can 
> > share */
> > -#define MAX_THREAD_GROUP_PROPERTIES 1
> > +#define MAX_THREAD_GROUP_PROPERTIES 2
> > 
> >  struct thread_groups_list {
> >     unsigned int nr_properties;
> > @@ -121,6 +123,13 @@ struct thread_groups_list {
> >   */
> >  DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
> > 
> > +/*
> > + * On some big-cores system, thread_group_l2_cache_map for each CPU
> > + * corresponds to the set its siblings within the core that share the
> > + * L2-cache.
> > + */
> > +DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
> > +
> 
> NIT:
> We are trying to confuse ourselves with the names.
> For L1 we have cpu_l2_cache_map to store the tasks from the thread group.
> but cpu_smallcore_map for keeping track of tasks.
>

I suppose you mean cpu_l1_cache_map here. We are using
cpu_smallcore_map, because when the ibm,thread-groups-property=1, it
shares both L1 and the instruction data flow (SMT). We already have a
cpu_smt_map, hence, this was named cpu_smallcore_map a couple of years
ago when I wrote that patch.


> For L2 we have thread_group_l2_cache_map to store the tasks from the thread
> group.  but cpu_l2_cache_map for keeping track of tasks.

> 
> I think we should do some renaming to keep the names consistent.
> I would say probably say move the current cpu_l2_cache_map to
> cpu_llc_cache_map and move the new aka  thread_group_l2_cache_map as
> cpu_l2_cache_map to be somewhat consistent.

Hmm.. cpu_llc_cache_map is still very generic. We want to have
something that defines l2 map.

I agree that we need to keep it consistent. How about renaming
cpu_l1_cache_map to thread_groups_l1_cache_map ?

That way thread_groups_l1_cache_map and thread_groups_l2_cache_map
refer to the corresponding L1 and L2 siblings as discovered from
ibm,thread-groups property.

We have the cpu_smallcore_mask and the cpu_l2_cache_map unchanged as
it was before.


> 
> >  /* SMP operations for this machine */
> >  struct smp_ops_t *smp_ops;
> > 
> > @@ -840,7 +851,8 @@ static int init_cpu_cache_map(int cpu, unsigned int 
> > cache_property)
> >     if (!dn)
> >             return -ENODATA;
> > 
> > -   if (!(cache_property == THREAD_GROUP_SHARE_L1))
> > +   if (!(cache_property == THREAD_GROUP_SHARE_L1 ||
> > +         cache_property == THREAD_GROUP_SHARE_L2))
> >             return -EINVAL;
> > 
> >     if (!cpu_tgl->nr_properties) {
> > @@ -867,7 +879,10 @@ static int init_cpu_cache_map(int cpu, unsigned int 
> > cache_property)
> >             goto out;
> >     }
> > 
> > -   mask = &per_cpu(cpu_l1_cache_map, cpu);
> > +   if (cache_property == THREAD_GROUP_SHARE_L1)
> > +           mask = &per_cpu(cpu_l1_cache_map, cpu);
> > +   else if (cache_property == THREAD_GROUP_SHARE_L2)
> > +           mask = &per_cpu(thread_group_l2_cache_map, cpu);
> > 
> >     zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> > 
> > @@ -973,6 +988,16 @@ static int init_big_cores(void)
> >     }
> > 
> >     has_big_cores = true;
> > +
> > +   for_each_possible_cpu(cpu) {
> > +           int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> > +
> > +           if (err)
> > +                   return err;
> > +   }
> > +
> > +   thread_group_shares_l2 = true;
> 
> Why do we need a separate loop. Why cant we merge this in the above loop
> itself?


No, there are platforms where one THREAD_GROUP_SHARE_L1 exists while
THREAD_GROUP_SHARE_L2 doesn't exist. It becomes easier if these are
separately tracked. Also, what do we gain if we put this in the same
loop? It will be (nr_possible_cpus * 2 * invocations of
init_cpu_cache_map()) as opposed to 2 * (nr_possible_cpus *
invocations of init_cpu_cache_map()). Isn't it ?




> 
> > +   pr_info("Thread-groups in a core share L2-cache\n");
> 
> Can this be moved to a pr_debug? Does it help any regular user/admins to
> know if thread-groups shared l2 cache. Infact it may confuse users on what
> thread groups are and which thread groups dont share cache.
> I would prefer some other name than thread_group_shares_l2 but dont know any
> better alternatives and may be my choices are even worse.

Would you be ok with "L2 cache shared by threads of the small core" ?


> 
> >     return 0;
> >  }
> > 
> > @@ -1287,6 +1312,31 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t 
> > *mask)
> >     if (has_big_cores)
> >             submask_fn = cpu_smallcore_mask;
> > 
> > +
> 
> NIT: extra blank line?

Will remove this. 
> 
> > +   /*
> > +    * If the threads in a thread-group share L2 cache, then then
> > +    * the L2-mask can be obtained from thread_group_l2_cache_map.
> > +    */
> > +   if (thread_group_shares_l2) {
> > +           /* Siblings that share L1 is a subset of siblings that share 
> > L2.*/
> > +           or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
> > +           if (*mask) {
> > +                   cpumask_andnot(*mask,
> > +                                  per_cpu(thread_group_l2_cache_map, cpu),
> > +                                  cpu_l2_cache_mask(cpu));
> > +           } else {
> > +                   mask = &per_cpu(thread_group_l2_cache_map, cpu);
> > +           }
> > +
> > +           for_each_cpu(i, *mask) {
> > +                   if (!cpu_online(i))
> > +                           continue;
> > +                   set_cpus_related(i, cpu, cpu_l2_cache_mask);
> > +           }
> > +
> > +           return true;
> > +   }
> > +
> 
> Ah this can be simplified to:
> if (thread_group_shares_l2) {
>       cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
> 
>       for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
>               if (cpu_online(i))
>                       set_cpus_related(i, cpu, cpu_l2_cache_mask);
>       }

Don't we want to enforce that the siblings sharing L1 be a subset of
the siblings sharing L2 ? Or do you recommend putting in a check for
that somewhere ?


> }
> 
> No?
> 
> >     l2_cache = cpu_to_l2cache(cpu);
> >     if (!l2_cache || !*mask) {
> >             /* Assume only core siblings share cache with this CPU */
> 
> -- 
> Thanks and Regards
> Srikar Dronamraju

Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

Reply via email to