Re: [PATCH v5] x86,sched: allow topologies where NUMA nodes share an LLC

2018-04-16 Thread Borislav Petkov
On Fri, Apr 06, 2018 at 05:21:30PM -0700, Alison Schofield wrote:
> From: Alison Schofield 
> 
> Intel's Skylake Server CPUs have a different LLC topology than previous
> generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
> divided into two "slices", each containing half the cores, half the LLC,
> and one memory controller and each slice is enumerated to Linux as a
> NUMA node. This is similar to how the cores and LLC were arranged
> for the Cluster-On-Die (CoD) feature.
> 
> CoD allowed the same cache line to be present in each half of the LLC.
> But, with SNC, each line is only ever present in *one* slice. This
> means that the portion of the LLC *available* to a CPU depends on the
> data being accessed:
> 
> Remote socket: entire package LLC is shared
> Local socket->local slice: data goes into local slice LLC
> Local socket->remote slice: data goes into remote-slice LLC. Slightly
>   higher latency than local slice LLC.
> 
> The biggest implication from this is that a process accessing all
> NUMA-local memory only sees half the LLC capacity.
> 
> The CPU describes its cache hierarchy with the CPUID instruction. One
> of the CPUID leaves enumerates the "logical processors sharing this
> cache". This information is used for scheduling decisions so that tasks
> move more freely between CPUs sharing the cache.
> 
> But, the CPUID for the SNC configuration discussed above enumerates
> the LLC as being shared by the entire package. This is not 100%
> precise because the entire cache is not usable by all accesses. But,
> it *is* the way the hardware enumerates itself, and this is not likely
> to change.
> 
> The userspace visible impact of all the above is that the sysfs info
> reports the entire LLC as being available to the entire package. As
> noted above, this is not true for local socket accesses. This patch
> does not correct the sysfs info. It is the same, pre and post patch.
> 
> This patch continues to allow this SNC topology and it does so without
> complaint. It eliminates a warning that looks like this:
> 
>   sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 
> 0]. Ignoring dependency.
> 
> The warning is coming from the sane_topology check() in smpboot.c.

s/sane_topology check()/topology_sane() check/

> To fix this, add a vendor and model specific check to never call
> topology_sane() for these systems. Also, just like "Cluster-on-Die"
> we throw out the "coregroup" sched_domain_topology_level and use
> NUMA information from the SRAT alone.
> 
> This is OK at least on the hardware we are immediately concerned about
> because the LLC sharing happens at both the slice and at the package
> level, which are also NUMA boundaries.

I wish everyone would write commit messages like this. Very good and
nicely written explanation!

> Signed-off-by: Alison Schofield 
> Cc: Dave Hansen 
> Cc: Tony Luck 
> Cc: Tim Chen 
> Cc: "H. Peter Anvin" 
> Cc: Borislav Petkov 
> Cc: Peter Zijlstra (Intel) 
> Cc: David Rientjes 
> Cc: Igor Mammedov 
> Cc: Prarit Bhargava 
> Cc: brice.gog...@gmail.com
> Cc: Ingo Molnar 
> ---

Reviewed-by: Borislav Petkov 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v5] x86,sched: allow topologies where NUMA nodes share an LLC

2018-04-16 Thread Borislav Petkov
On Fri, Apr 06, 2018 at 05:21:30PM -0700, Alison Schofield wrote:
> From: Alison Schofield 
> 
> Intel's Skylake Server CPUs have a different LLC topology than previous
> generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
> divided into two "slices", each containing half the cores, half the LLC,
> and one memory controller and each slice is enumerated to Linux as a
> NUMA node. This is similar to how the cores and LLC were arranged
> for the Cluster-On-Die (CoD) feature.
> 
> CoD allowed the same cache line to be present in each half of the LLC.
> But, with SNC, each line is only ever present in *one* slice. This
> means that the portion of the LLC *available* to a CPU depends on the
> data being accessed:
> 
> Remote socket: entire package LLC is shared
> Local socket->local slice: data goes into local slice LLC
> Local socket->remote slice: data goes into remote-slice LLC. Slightly
>   higher latency than local slice LLC.
> 
> The biggest implication from this is that a process accessing all
> NUMA-local memory only sees half the LLC capacity.
> 
> The CPU describes its cache hierarchy with the CPUID instruction. One
> of the CPUID leaves enumerates the "logical processors sharing this
> cache". This information is used for scheduling decisions so that tasks
> move more freely between CPUs sharing the cache.
> 
> But, the CPUID for the SNC configuration discussed above enumerates
> the LLC as being shared by the entire package. This is not 100%
> precise because the entire cache is not usable by all accesses. But,
> it *is* the way the hardware enumerates itself, and this is not likely
> to change.
> 
> The userspace visible impact of all the above is that the sysfs info
> reports the entire LLC as being available to the entire package. As
> noted above, this is not true for local socket accesses. This patch
> does not correct the sysfs info. It is the same, pre and post patch.
> 
> This patch continues to allow this SNC topology and it does so without
> complaint. It eliminates a warning that looks like this:
> 
>   sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 
> 0]. Ignoring dependency.
> 
> The warning is coming from the sane_topology check() in smpboot.c.

s/sane_topology check()/topology_sane() check/

> To fix this, add a vendor and model specific check to never call
> topology_sane() for these systems. Also, just like "Cluster-on-Die"
> we throw out the "coregroup" sched_domain_topology_level and use
> NUMA information from the SRAT alone.
> 
> This is OK at least on the hardware we are immediately concerned about
> because the LLC sharing happens at both the slice and at the package
> level, which are also NUMA boundaries.

I wish everyone would write commit messages like this. Very good and
nicely written explanation!

> Signed-off-by: Alison Schofield 
> Cc: Dave Hansen 
> Cc: Tony Luck 
> Cc: Tim Chen 
> Cc: "H. Peter Anvin" 
> Cc: Borislav Petkov 
> Cc: Peter Zijlstra (Intel) 
> Cc: David Rientjes 
> Cc: Igor Mammedov 
> Cc: Prarit Bhargava 
> Cc: brice.gog...@gmail.com
> Cc: Ingo Molnar 
> ---

Reviewed-by: Borislav Petkov 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


[PATCH v5] x86,sched: allow topologies where NUMA nodes share an LLC

2018-04-06 Thread Alison Schofield
From: Alison Schofield 

Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
divided into two "slices", each containing half the cores, half the LLC,
and one memory controller and each slice is enumerated to Linux as a
NUMA node. This is similar to how the cores and LLC were arranged
for the Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This
means that the portion of the LLC *available* to a CPU depends on the
data being accessed:

Remote socket: entire package LLC is shared
Local socket->local slice: data goes into local slice LLC
Local socket->remote slice: data goes into remote-slice LLC. Slightly
higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One
of the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates
the LLC as being shared by the entire package. This is not 100%
precise because the entire cache is not usable by all accesses. But,
it *is* the way the hardware enumerates itself, and this is not likely
to change.

The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As
noted above, this is not true for local socket accesses. This patch
does not correct the sysfs info. It is the same, pre and post patch.

This patch continues to allow this SNC topology and it does so without
complaint. It eliminates a warning that looks like this:

sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 
0]. Ignoring dependency.

The warning is coming from the sane_topology check() in smpboot.c.
To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die"
we throw out the "coregroup" sched_domain_topology_level and use
NUMA information from the SRAT alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package
level, which are also NUMA boundaries.

Signed-off-by: Alison Schofield 
Cc: Dave Hansen 
Cc: Tony Luck 
Cc: Tim Chen 
Cc: "H. Peter Anvin" 
Cc: Borislav Petkov 
Cc: Peter Zijlstra (Intel) 
Cc: David Rientjes 
Cc: Igor Mammedov 
Cc: Prarit Bhargava 
Cc: brice.gog...@gmail.com
Cc: Ingo Molnar 
---

Changes in v5:

 * Removed the redundant setting of boolean x86_has_numa_in_package.
   Related comments were put back in their pre-patch locations.

Changes in v4:

 * Added this to the patch description above:

   The userspace visible impact of all the above is that the sysfs info
   reports the entire LLC as being available to the entire package. As
   noted above, this is not true for local socket accesses. This patch
   does not correct the sysfs info. It is the same, pre and post patch.

   This patch continues to allow this SNC topology and it does so without
   complaint. It eliminates a warning that looks like this:
  
 * Changed the code comment per PeterZ/DaveH discussion wrt bypassing
   that topology_sane() check in match_llc()
/*
 * false means 'c' does not share the LLC of 'o'.
 * Note: this decision gets reflected all the way
 * out to userspace
 */
   This message hopes to clarify what happens when we return false.
   Note that returning false is *not* new to this patch. Without
   this patch we always returned false - with a warning. This avoids
   that warning and returns false directly.

 * Remove __initconst from snc_cpu[] declaration that I had added in
   v3. This is not an init time only path. 

 * Do not deal with the wrong sysfs info. It was wrong before this
   patch and it will be the exact same 'wrongness' after this patch.

   We can address the sysfs reporting separately. Here are some options:
   1) Change the way the LLC-size is reported.  Enumerate two separate,
  half-sized LLCs shared only by the slice when SNC mode is on.
   2) Do not export the sysfs info that is wrong. Prevents userspace
  from making bad decisions based on inaccurate info.


Changes in v3:

 * Use x86_match_cpu() for vendor & model check and moved related
   comments to the array define. (Still just one system model)

 * 

[PATCH v5] x86,sched: allow topologies where NUMA nodes share an LLC

2018-04-06 Thread Alison Schofield
From: Alison Schofield 

Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
divided into two "slices", each containing half the cores, half the LLC,
and one memory controller and each slice is enumerated to Linux as a
NUMA node. This is similar to how the cores and LLC were arranged
for the Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This
means that the portion of the LLC *available* to a CPU depends on the
data being accessed:

Remote socket: entire package LLC is shared
Local socket->local slice: data goes into local slice LLC
Local socket->remote slice: data goes into remote-slice LLC. Slightly
higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One
of the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates
the LLC as being shared by the entire package. This is not 100%
precise because the entire cache is not usable by all accesses. But,
it *is* the way the hardware enumerates itself, and this is not likely
to change.

The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As
noted above, this is not true for local socket accesses. This patch
does not correct the sysfs info. It is the same, pre and post patch.

This patch continues to allow this SNC topology and it does so without
complaint. It eliminates a warning that looks like this:

sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 
0]. Ignoring dependency.

The warning is coming from the sane_topology check() in smpboot.c.
To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die"
we throw out the "coregroup" sched_domain_topology_level and use
NUMA information from the SRAT alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package
level, which are also NUMA boundaries.

Signed-off-by: Alison Schofield 
Cc: Dave Hansen 
Cc: Tony Luck 
Cc: Tim Chen 
Cc: "H. Peter Anvin" 
Cc: Borislav Petkov 
Cc: Peter Zijlstra (Intel) 
Cc: David Rientjes 
Cc: Igor Mammedov 
Cc: Prarit Bhargava 
Cc: brice.gog...@gmail.com
Cc: Ingo Molnar 
---

Changes in v5:

 * Removed the redundant setting of boolean x86_has_numa_in_package.
   Related comments were put back in their pre-patch locations.

Changes in v4:

 * Added this to the patch description above:

   The userspace visible impact of all the above is that the sysfs info
   reports the entire LLC as being available to the entire package. As
   noted above, this is not true for local socket accesses. This patch
   does not correct the sysfs info. It is the same, pre and post patch.

   This patch continues to allow this SNC topology and it does so without
   complaint. It eliminates a warning that looks like this:
  
 * Changed the code comment per PeterZ/DaveH discussion wrt bypassing
   that topology_sane() check in match_llc()
/*
 * false means 'c' does not share the LLC of 'o'.
 * Note: this decision gets reflected all the way
 * out to userspace
 */
   This message hopes to clarify what happens when we return false.
   Note that returning false is *not* new to this patch. Without
   this patch we always returned false - with a warning. This avoids
   that warning and returns false directly.

 * Remove __initconst from snc_cpu[] declaration that I had added in
   v3. This is not an init time only path. 

 * Do not deal with the wrong sysfs info. It was wrong before this
   patch and it will be the exact same 'wrongness' after this patch.

   We can address the sysfs reporting separately. Here are some options:
   1) Change the way the LLC-size is reported.  Enumerate two separate,
  half-sized LLCs shared only by the slice when SNC mode is on.
   2) Do not export the sysfs info that is wrong. Prevents userspace
  from making bad decisions based on inaccurate info.


Changes in v3:

 * Use x86_match_cpu() for vendor & model check and moved related
   comments to the array define. (Still just one system model)

 * Updated the comments surrounding the topology_sane() check.


Changes in v2:

 * Add vendor check (Intel) where we only had a model check (Skylake_X).
   Considered the suggestion of adding a new flag here but thought that
   to be overkill for this usage.

 * Return false,