The NUMA balancing code spends way too much CPU time scanning and faulting when running multi-threaded workloads.
This patch set slows down NUMA PTE scanning when there are lots of shared faults, and when dealing with large NUMA groups that have a large fraction of shared faults. Some results from Jirka's half-week performance run, on a 4 node system: - improvements in the range of 10-30% for NAS benchmarks (mostly ft and lu subtests) - SPECjbb2005 single instance mode - improvements in the range of 5-10% - SPECjvm2008 - performance very similar to before, some small improvements for the scimark* subtests