[tip: x86/urgent] x86/asm/64: Align start of __clear_user() loop to 16-bytes

2020-06-19 Thread tip-bot2 for Matt Fleming
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: bb5570ad3b54e7930997aec76ab68256d5236d94
Gitweb:
https://git.kernel.org/tip/bb5570ad3b54e7930997aec76ab68256d5236d94
Author:Matt Fleming 
AuthorDate:Thu, 18 Jun 2020 11:20:02 +01:00
Committer: Borislav Petkov 
CommitterDate: Fri, 19 Jun 2020 18:32:11 +02:00

x86/asm/64: Align start of __clear_user() loop to 16-bytes

x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window, or worse, a 64-byte cacheline. This issues was discovered in the
SUSE kernel with the following commit,

  1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate 
constants")

which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across a
64-byte cacheline.

Aligning the start of the loop to 16-bytes makes this fit neatly inside
a single instruction fetch window again and restores the performance of
__clear_user() which is used heavily when reading from /dev/zero.

Here are some numbers from running libmicro's read_z* and pread_z*
microbenchmarks which read from /dev/zero:

  Zen 1 (Naples)

  libmicro-file
5.7.0-rc6  5.7.0-rc6
  5.7.0-rc6
revert-1153933703d9+
   align16+
  Time mean95-pread_z100k   9.9195 (   0.00%)  5.9856 (  39.66%)  
5.9938 (  39.58%)
  Time mean95-pread_z10k1.1378 (   0.00%)  0.7450 (  34.52%)  
0.7467 (  34.38%)
  Time mean95-pread_z1k 0.2623 (   0.00%)  0.2251 (  14.18%)  
0.2252 (  14.15%)
  Time mean95-pread_zw100k  9.9974 (   0.00%)  6.0648 (  39.34%)  
6.0756 (  39.23%)
  Time mean95-read_z100k9.8940 (   0.00%)  5.9885 (  39.47%)  
5.9994 (  39.36%)
  Time mean95-read_z10k 1.1394 (   0.00%)  0.7483 (  34.33%)  
0.7482 (  34.33%)

Note that this doesn't affect Haswell or Broadwell microarchitectures
which seem to avoid the alignment issue by executing the loop straight
out of the Loop Stream Detector (verified using perf events).

Fixes: 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate 
constants")
Signed-off-by: Matt Fleming 
Signed-off-by: Borislav Petkov 
Cc:  # v4.19+
Link: https://lkml.kernel.org/r/20200618102002.30034-1-m...@codeblueprint.co.uk
---
 arch/x86/lib/usercopy_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index fff28c6..b0dfac3 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long 
size)
asm volatile(
"   testq  %[size8],%[size8]\n"
"   jz 4f\n"
+   "   .align 16\n"
"0: movq $0,(%[dst])\n"
"   addq   $8,%[dst]\n"
"   decl %%ecx ; jnz   0b\n"


[tip: sched/core] sched/topology: Improve load balancing on AMD EPYC systems

2019-09-03 Thread tip-bot2 for Matt Fleming
The following commit has been merged into the sched/core branch of tip:

Commit-ID: a55c7454a8c887b226a01d7eed088ccb5374d81e
Gitweb:
https://git.kernel.org/tip/a55c7454a8c887b226a01d7eed088ccb5374d81e
Author:Matt Fleming 
AuthorDate:Thu, 08 Aug 2019 20:53:01 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:37 +02:00

sched/topology: Improve load balancing on AMD EPYC systems

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in:

  commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

 node distances:
 node   0   1   2   3   4   5   6   7
   0:  10  16  16  16  32  32  32  32
   1:  16  10  16  16  32  32  32  32
   2:  16  16  10  16  32  32  32  32
   3:  16  16  16  10  32  32  32  32
   4:  32  32  32  32  10  16  16  16
   5:  32  32  32  32  16  10  16  16
   6:  32  32  32  32  16  16  10  16
   7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Mel Gorman 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: suravee.suthikulpa...@amd.com
Cc: Thomas Gleixner 
Cc: thomas.lenda...@amd.com
Cc: Tony Luck 
Link: https://lkml.kernel.org/r/20190808195301.13222-3-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/amd.c |  5 +
 include/linux/topology.h  | 14 ++
 kernel/sched/topology.c   |  3 ++-
 mm/khugepaged.c   |  2 +-
 mm/page_alloc.c   |  2 +-
 5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e504..ceeb8af 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -824,6 +825,10 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
 {
set_cpu_cap(c, X86_FEATURE_ZEN);
 
+#ifdef CONFIG_NUMA
+   node_reclaim_distance = 32;
+#endif
+
/*
 * Fix erratum 1076: CPB feature bit not being set in CPUID.
 * Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c..579522e 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
  */
 #define RECLAIM_DISTANCE 30
 #endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS (1)
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8f83e8e..b5667a2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int
sched_domains_curr_level;
 intsched_max_numa_distance;
 static int *sched_domains_numa_distance;
 static struct cpumask  ***sched_domains_numa_masks;
+int __read_mostly  node_reclaim_distance = RECLAIM_DISTANCE;
 #endif
 
 /*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,
 
sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
-   if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
+   if (sched_domains_numa_distance[tl->numa_level] > 
node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
   SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b..ccede24 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool 

[tip: sched/core] arch, ia64: Make NUMA select SMP

2019-09-03 Thread tip-bot2 for Matt Fleming
The following commit has been merged into the sched/core branch of tip:

Commit-ID: a2cbfd46559e809c8165773b7fe8afa058b35414
Gitweb:
https://git.kernel.org/tip/a2cbfd46559e809c8165773b7fe8afa058b35414
Author:Matt Fleming 
AuthorDate:Thu, 08 Aug 2019 20:53:00 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:36 +02:00

arch, ia64: Make NUMA select SMP

While it does make sense to allow CONFIG_NUMA and !CONFIG_SMP in
theory, it doesn't make much sense in practice.

Follow other architectures and make CONFIG_NUMA select CONFIG_SMP.

The motivation for this patch is to allow a new NUMA variable to be
initialised in kernel/sched/topology.c.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: suravee.suthikulpa...@amd.com
Cc: Thomas Gleixner 
Cc: thomas.lenda...@amd.com
Cc: Tony Luck 
Link: https://lkml.kernel.org/r/20190808195301.13222-2-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/ia64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 7468d8e..997baba 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -389,6 +389,7 @@ config NUMA
depends on !IA64_HP_SIM && !FLATMEM
default y if IA64_SGI_SN2
select ACPI_NUMA if ACPI
+   select SMP
help
  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
  Access).  This option is for configuring high-end multiprocessor