On Wed, Jun 05, 2019 at 04:59:22PM +0100, Matt Fleming wrote: > SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init() > for any sched domains with a NUMA distance greater than 2 hops > (RECLAIM_DISTANCE). The idea being that it's expensive to balance > across domains that far apart. > > However, as is rather unfortunately explained in > > commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30") > > the value for RECLAIM_DISTANCE is based on node distance tables from > 2011-era hardware. > > Current AMD EPYC machines have the following NUMA node distances: > > node distances: > node 0 1 2 3 4 5 6 7 > 0: 10 16 16 16 32 32 32 32 > 1: 16 10 16 16 32 32 32 32 > 2: 16 16 10 16 32 32 32 32 > 3: 16 16 16 10 32 32 32 32 > 4: 32 32 32 32 10 16 16 16 > 5: 32 32 32 32 16 10 16 16 > 6: 32 32 32 32 16 16 10 16 > 7: 32 32 32 32 16 16 16 10 > > where 2 hops is 32. > > The result is that the scheduler fails to load balance properly across > NUMA nodes on different sockets -- 2 hops apart. >
> Update the code in sd_init() to account for modern node distances, and > maintaining backward-compatible behaviour by respecting > RECLAIM_DISTANCE for distances more than 2 hops. And then we had two magic values :/ Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because surely, if we want to load-balance agressively over 30, then so too should we do node_reclaim() I'm thikning.