Hi Chris On Sat, 24 Oct 2020 at 01:49, Chris Mason <c...@fb.com> wrote: > > Hi everyone, > > We’re validating a new kernel in the fleet, and compared with v5.2,
Which version are you using ? several improvements have been added since v5.5 and the rework of load_balance > performance is ~2-3% lower for some of our workloads. After some > digging, Johannes found that our involuntary context switch rate was ~2x > higher, and we were leaving a CPU idle a higher percentage of the time, > even though the workload was trying to saturate the system. > > We were able to reproduce the problem with schbench, and Johannes > bisected down to: > > commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912 > Author: Vincent Guittot <vincent.guit...@linaro.org> > Date: Fri Oct 18 15:26:31 2019 +0200 > > sched/fair: Rework load_balance() > > Our working theory is the load balancing changes are leaving processes > behind busy CPUs instead of moving them onto idle ones. I made a few > schbench modifications to make this easier to demonstrate: > > https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/ > > My VM has 40 cpus (20 cores, 2 threads per core), and my schbench > command line is: What is the topology ? are they all part of the same LLC ? > > schbench -t 20 -r 0 -c 1000000 -s 1000 -i 30 -z 120 > > This has two message threads, and 20 workers per message thread. Once > woken up, the workers think for a full second, which means you’ll have > some long latencies if you’re stuck behind one of these workers in the > runqueue. The message thread does a little bit of work and then sleeps, > so we end up with 40 threads hammering full blast on the CPU and 2 > threads popping in and out of idle. > > schbench times the delay from when a message thread wakes a worker to > when the worker runs. On a good kernel, the output looks like this: > > Latency percentiles (usec) runtime 1290 (s) (3280 total samples) > 50.0th: 155 (1653 samples) > 75.0th: 189 (808 samples) > 90.0th: 216 (501 samples) > 95.0th: 227 (163 samples) > *99.0th: 256 (123 samples) > 99.5th: 1510 (16 samples) > 99.9th: 3132 (13 samples) > min=21, max=3286 > > With 0b0695f2b34a, we get this: > > Latency percentiles (usec) runtime 1440 (s) (4480 total samples) > 50.0th: 147 (2261 samples) > 75.0th: 182 (1116 samples) > 90.0th: 205 (671 samples) > 95.0th: 224 (215 samples) > *99.0th: 12240 (173 samples) <—— much higher p99 and up > 99.5th: 12752 (22 samples) > 99.9th: 13104 (18 samples) > min=21, max=13172 > > Since the idea is to fully load the machine with schbench, use schbench > -t <your_num_cpus/2>, and make sure the box doesn’t have other stuff > running in the background. I used a VM because it ended up giving more > consistent results on our kernel test machines, which have some periodic > noise running in the background. > > We’ve tried a few different approaches, but don’t quite have a solid > fix yet. I thought I’d kick off the discussion with my most useful > hunks so far: > > diff a/kernel/sched/fair.c b/kernel/sched/fair.c > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > > -chris