With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup. This leads to wrong choice of CPU, which in-turn leads to larger wakeup latencies. Eventually, it leads to performance regression in latency sensitive benchmarks like soltp, schbench etc.
On Powerpc, vcpu_is_preempted only looks at yield_count. If the yield_count is odd, the vCPU is assumed to be preempted. However yield_count is increased whenever LPAR enters CEDE state. So any CPU that has entered CEDE state is assumed to be preempted. Even if vCPU of dedicated LPAR is preempted/donated, it should have right of first-use since they are suppose to own the vCPU. On a Power9 System with 32 cores # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 8 Core(s) per socket: 1 Socket(s): 16 NUMA node(s): 2 Model: 2.2 (pvr 004e 0202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 # perf stat -a -r 5 ./schbench v5.4 v5.4 + patch Latency percentiles (usec) Latency percentiles (usec) 49.0000th: 47 50.0000th: 33 74.0000th: 64 75.0000th: 44 89.0000th: 76 90.0000th: 50 94.0000th: 83 95.0000th: 53 *98.0000th: 103 *99.0000th: 57 98.5000th: 2124 99.5000th: 59 98.9000th: 7976 99.9000th: 83 min=-1, max=10519 min=0, max=117 Latency percentiles (usec) Latency percentiles (usec) 49.0000th: 45 50.0000th: 34 74.0000th: 61 75.0000th: 45 89.0000th: 70 90.0000th: 52 94.0000th: 77 95.0000th: 56 *98.0000th: 504 *99.0000th: 62 98.5000th: 4012 99.5000th: 64 98.9000th: 8168 99.9000th: 79 min=-1, max=14500 min=0, max=123 Latency percentiles (usec) Latency percentiles (usec) 49.0000th: 48 50.0000th: 35 74.0000th: 65 75.0000th: 47 89.0000th: 76 90.0000th: 55 94.0000th: 82 95.0000th: 59 *98.0000th: 1098 *99.0000th: 67 98.5000th: 3988 99.5000th: 71 98.9000th: 9360 99.9000th: 98 min=-1, max=19283 min=0, max=137 Latency percentiles (usec) Latency percentiles (usec) 49.0000th: 46 50.0000th: 35 74.0000th: 63 75.0000th: 46 89.0000th: 73 90.0000th: 53 94.0000th: 78 95.0000th: 57 *98.0000th: 113 *99.0000th: 63 98.5000th: 2316 99.5000th: 65 98.9000th: 7704 99.9000th: 83 min=-1, max=17976 min=0, max=139 Latency percentiles (usec) Latency percentiles (usec) 49.0000th: 46 50.0000th: 34 74.0000th: 62 75.0000th: 46 89.0000th: 73 90.0000th: 53 94.0000th: 79 95.0000th: 57 *98.0000th: 97 *99.0000th: 64 98.5000th: 1398 99.5000th: 70 98.9000th: 8136 99.9000th: 100 min=-1, max=10008 min=0, max=142 Performance counter stats for 'system wide' (4 runs): context-switches 42,604 ( +- 0.87% ) 45,397 ( +- 0.25% ) cpu-migrations 0,195 ( +- 2.70% ) 230 ( +- 7.23% ) page-faults 16,783 ( +- 14.87% ) 16,781 ( +- 9.77% ) Waiman Long suggested using static_keys. Reported-by: Parth Shah <pa...@linux.ibm.com> Reported-by: Ihor Pasichnyk <ihor.pasich...@ibm.com> Cc: Parth Shah <pa...@linux.ibm.com> Cc: Ihor Pasichnyk <ihor.pasich...@ibm.com> Cc: Juri Lelli <juri.le...@redhat.com> Cc: Waiman Long <long...@redhat.com> Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com> --- arch/powerpc/include/asm/spinlock.h | 5 +++-- arch/powerpc/mm/numa.c | 4 ++++ 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index e9a960e28f3c..866f6ca0427a 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -35,11 +35,12 @@ #define LOCK_TOKEN 1 #endif -#ifdef CONFIG_PPC_PSERIES +#if defined(CONFIG_PPC_PSERIES) && defined(CONFIG_PPC_SPLPAR) +DECLARE_STATIC_KEY_FALSE(shared_processor); #define vcpu_is_preempted vcpu_is_preempted static inline bool vcpu_is_preempted(int cpu) { - if (!firmware_has_feature(FW_FEATURE_SPLPAR)) + if (!static_branch_unlikely(&shared_processor)) return false; return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); } diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 50d68d21ddcc..ffb971f3a63c 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1568,9 +1568,13 @@ int prrn_is_enabled(void) return prrn_enabled; } +DEFINE_STATIC_KEY_FALSE(shared_processor); +EXPORT_SYMBOL_GPL(shared_processor); + void __init shared_proc_topology_init(void) { if (lppaca_shared_proc(get_lppaca())) { + static_branch_enable(&shared_processor); bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask), nr_cpumask_bits); numa_update_cpu_topology(false); -- 2.18.1