Hi Srikar, On Wed, Dec 04, 2019 at 07:14:58PM +0530, Srikar Dronamraju wrote: > With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted > vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup. > This leads to wrong choice of CPU, which in-turn leads to larger wakeup > latencies. Eventually, it leads to performance regression in latency > sensitive benchmarks like soltp, schbench etc.
The regression in the latency sensitive benchmarks is due to preferring potentially busy vCPUs over vCPUs in the CEDE state. > > On Powerpc, vcpu_is_preempted only looks at yield_count. If the > yield_count is odd, the vCPU is assumed to be preempted. However > yield_count is increased whenever LPAR enters CEDE state. So any CPU > that has entered CEDE state is assumed to be preempted. > > Even if vCPU of dedicated LPAR is preempted/donated, it should have > right of first-use since they are suppose to own the vCPU. > > On a Power9 System with 32 cores > # lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Thread(s) per core: 8 > Core(s) per socket: 1 > Socket(s): 16 > NUMA node(s): 2 > Model: 2.2 (pvr 004e 0202) > Model name: POWER9 (architected), altivec supported > Hypervisor vendor: pHyp > Virtualization type: para > L1d cache: 32K > L1i cache: 32K > L2 cache: 512K > L3 cache: 10240K > NUMA node0 CPU(s): 0-63 > NUMA node1 CPU(s): 64-127 > > > # perf stat -a -r 5 ./schbench > v5.4 v5.4 + patch > Latency percentiles (usec) Latency percentiles (usec) > 49.0000th: 47 50.0000th: 33 > 74.0000th: 64 75.0000th: 44 > 89.0000th: 76 90.0000th: 50 > 94.0000th: 83 95.0000th: 53 > *98.0000th: 103 *99.0000th: 57 > 98.5000th: 2124 99.5000th: 59 > 98.9000th: 7976 99.9000th: 83 > min=-1, max=10519 min=0, max=117 > Latency percentiles (usec) Latency percentiles (usec) > 49.0000th: 45 50.0000th: 34 > 74.0000th: 61 75.0000th: 45 > 89.0000th: 70 90.0000th: 52 > 94.0000th: 77 95.0000th: 56 > *98.0000th: 504 *99.0000th: 62 > 98.5000th: 4012 99.5000th: 64 > 98.9000th: 8168 99.9000th: 79 > min=-1, max=14500 min=0, max=123 > Latency percentiles (usec) Latency percentiles (usec) > 49.0000th: 48 50.0000th: 35 > 74.0000th: 65 75.0000th: 47 > 89.0000th: 76 90.0000th: 55 > 94.0000th: 82 95.0000th: 59 > *98.0000th: 1098 *99.0000th: 67 > 98.5000th: 3988 99.5000th: 71 > 98.9000th: 9360 99.9000th: 98 > min=-1, max=19283 min=0, max=137 > Latency percentiles (usec) Latency percentiles (usec) > 49.0000th: 46 50.0000th: 35 > 74.0000th: 63 75.0000th: 46 > 89.0000th: 73 90.0000th: 53 > 94.0000th: 78 95.0000th: 57 > *98.0000th: 113 *99.0000th: 63 > 98.5000th: 2316 99.5000th: 65 > 98.9000th: 7704 99.9000th: 83 > min=-1, max=17976 min=0, max=139 > Latency percentiles (usec) Latency percentiles (usec) > 49.0000th: 46 50.0000th: 34 > 74.0000th: 62 75.0000th: 46 > 89.0000th: 73 90.0000th: 53 > 94.0000th: 79 95.0000th: 57 > *98.0000th: 97 *99.0000th: 64 > 98.5000th: 1398 99.5000th: 70 > 98.9000th: 8136 99.9000th: 100 > min=-1, max=10008 min=0, max=142 > > Performance counter stats for 'system wide' (4 runs): > > context-switches 42,604 ( +- 0.87% ) 45,397 ( +- 0.25% ) > cpu-migrations 0,195 ( +- 2.70% ) 230 ( +- 7.23% ) > page-faults 16,783 ( +- 14.87% ) 16,781 ( +- 9.77% ) > > Waiman Long suggested using static_keys. This needs to be Cc'ed to the stable kernel 4.18+ Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> > > Reported-by: Parth Shah <pa...@linux.ibm.com> > Reported-by: Ihor Pasichnyk <ihor.pasich...@ibm.com> > Cc: Parth Shah <pa...@linux.ibm.com> > Cc: Ihor Pasichnyk <ihor.pasich...@ibm.com> > Cc: Juri Lelli <juri.le...@redhat.com> > Cc: Waiman Long <long...@redhat.com> > Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com> > --- > arch/powerpc/include/asm/spinlock.h | 5 +++-- > arch/powerpc/mm/numa.c | 4 ++++ > 2 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/include/asm/spinlock.h > b/arch/powerpc/include/asm/spinlock.h > index e9a960e28f3c..866f6ca0427a 100644 > --- a/arch/powerpc/include/asm/spinlock.h > +++ b/arch/powerpc/include/asm/spinlock.h > @@ -35,11 +35,12 @@ > #define LOCK_TOKEN 1 > #endif > > -#ifdef CONFIG_PPC_PSERIES > +#if defined(CONFIG_PPC_PSERIES) && defined(CONFIG_PPC_SPLPAR) > +DECLARE_STATIC_KEY_FALSE(shared_processor); > #define vcpu_is_preempted vcpu_is_preempted > static inline bool vcpu_is_preempted(int cpu) > { > - if (!firmware_has_feature(FW_FEATURE_SPLPAR)) > + if (!static_branch_unlikely(&shared_processor)) > return false; > return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); > } > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 50d68d21ddcc..ffb971f3a63c 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1568,9 +1568,13 @@ int prrn_is_enabled(void) > return prrn_enabled; > } > > +DEFINE_STATIC_KEY_FALSE(shared_processor); > +EXPORT_SYMBOL_GPL(shared_processor); > + > void __init shared_proc_topology_init(void) > { > if (lppaca_shared_proc(get_lppaca())) { > + static_branch_enable(&shared_processor); > bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask), > nr_cpumask_bits); > numa_update_cpu_topology(false); > -- > 2.18.1 >