I'm writting to see if it makes sense to track idle cpus in a shared cpumask in sched domain, then a task wakes up it can select idle cpu from this cpumask instead of scanning all the cpus in the last level cache domain, especially when the system is heavily loaded, the scanning cost could be significantly reduced. The price is that the atomic cpumask ops are added to the idle entry and exit paths.
I tested the following benchmarks on a x86 4 socket system with 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: uperf throughput: netperf workload, tcp_nodelay, r/w size = 90 threads baseline-avg %std patch-avg %std 96 1 1.24 0.98 2.76 144 1 1.13 1.35 4.01 192 1 0.58 1.67 3.25 240 1 2.49 1.68 3.55 hackbench: process mode, 100000 loops, 40 file descriptors per group group baseline-avg %std patch-avg %std 2(80) 1 12.05 0.97 9.88 3(120) 1 12.48 0.95 11.62 4(160) 1 13.83 0.97 13.22 5(200) 1 2.76 1.01 2.94 schbench: 99th percentile latency, 16 workers per message thread mthread baseline-avg %std patch-avg %std 6(96) 1 1.24 0.993 1.73 9(144) 1 0.38 0.998 0.39 12(192) 1 1.58 0.995 1.64 15(240) 1 51.71 0.606 37.41 sysbench mysql throughput: read/write, table size = 10,000,000 thread baseline-avg %std patch-avg %std 96 1 1.77 1.015 1.71 144 1 3.39 0.998 4.05 192 1 2.88 1.002 2.81 240 1 2.07 1.011 2.09 kbuild: kexec reboot every time baseline-avg patch-avg 1 1 Any suggestions are highly appreciated! Thanks, -Aubrey Aubrey Li (1): sched/fair: select idle cpu from idle cpumask in sched domain include/linux/sched/topology.h | 13 +++++++++++++ kernel/sched/fair.c | 4 +++- kernel/sched/topology.c | 2 +- 3 files changed, 17 insertions(+), 2 deletions(-) -- 2.25.1