Hello, we observe that group imbalance bug can cause performance degradation upto factor 10x on 4 NUMA server.
I have opened Bug 194231 https://bugzilla.kernel.org/show_bug.cgi?id=194231 for this issue. The problem was first described in this paper http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf in chapter 3.1. Scheduler is not correctly balancing load on 4 NUMA node server in the following scenario: * there are three independent ssh connections * first two ssh connections are running single threaded CPU intensive workload * last ssh session is running multi-threaded application which requires almost all cores in the system. We have used * stress --cpu 1 as single threaded CPU intensive workload http://people.seas.harvard.edu/~apw/stress/ and * lu.C.x benchmark from NAS Parallel Benchmarks suite as multi-threaded application https://www.nas.nasa.gov/publications/npb.html Version-Release number of selected component (if applicable): Reproduced on kernel 4.10.0-0.rc6 How reproducible: It requires at least 2 NUMA server. Problem gets worse on 4 NUMA server. Steps to Reproduce: 1. start 3 ssh connections to server 2. in first two ssh connections run stress --cpu 1 3. in the third ssh connection run lu.C.x benchmark with number of threads equal to number of CPUs in the system minus 4. 4. run either Intel's numatop echo "N" | numatop -d log >/dev/null 2>&1 & or mpstat -P ALL 5 and check the load distribution across the NUMA nodes. mpstat output can be processed by mpstat2node.py utility to aggregate data across NUMA nodes https://github.com/jhladka/MPSTAT2NODE/blob/master/mpstat2node.py mpstat -P ALL 5 | mpstat2node.py --lscpu <(lscpu) 5. Compare the results against the same workload started from ONE ssh session (all processes are in one group) Actual results: Uneven load across NUMA nodes: Average: NODE %usr %idle Average: all 66.12 33.51 Average: 0 37.97 61.74 Average: 1 31.67 68.15 Average: 2 97.50 1.98 Average: 3 97.33 2.19 Please notice that while number of CPU intensive threads is 62 on this 64 CPU system, NUMA nodes #0 and #1 are underutilized. Real runtime in seconds for lu.C.x benchmark went up from 114 seconds to 846 seconds! Expected results: Load evenly balanced across all NUMA nodes. Real runtime for lu.C.x benchmark same regardless if jobs were started from one ssh session or from multiply ssh sessions. Additional info: See https://github.com/jplozi/wastedcores/blob/master/patches/group_imbalance_linux_4.1.patch as proposal for the patch for kernel 4.1. I will upload a reproduced to the Bug report https://bugzilla.kernel.org/show_bug.cgi?id=194231 Thanks a lot! Jirka