Hi Julia,
On 03/09/20 15:09, Julia Lawall wrote: > Uses of SD_LOAD_BALANCE were removed in commit e669ac8ab952 (first > released in v5.8), with the comment: > > The SD_LOAD_BALANCE flag is set unconditionally for all domains in > sd_init(). > > I have the impression that this was not quite true. The NUMA domain was > not initialized with sd_init, and didn't have the SD_LOAD_BALANCE flag > set. Did you check the contents of /proc/sys/kernel/sched_domain/cpu*/domain*/flags (requires CONFIG_SCHED_DEBUG)? If the LSB is set, it would mean SD_LOAD_BALANCE is set. The sched_domain construction flow isn't the easiest thing to follow, but NUMA domains *have* to go through sd_init(). What happens is we first go through sched_init_numa(), and there we add some more topology levels on top of the default ones (or the arch-defined ones if using an arch-defined topology hierarchy) by using the NUMA distance table. We then build the actual domains in sched_init_domains(), and that goes through a loop that looks like for_each_cpu() { for_each_sd_topology() { build_sched_domain() -> sd_init() } } where the SD topology loop is going to iterate over the newly-added NUMA-specific topology levels. Since that used to unconditionally set SD_LOAD_BALANCE, NUMA domains really ought to have it. If that wasn't the case, we would have fired the (now removed) warning in sched_domain_debug_one() that would do: if (!(sd->flags & SD_LOAD_BALANCE)) { printk("does not load-balance\n"); if (sd->parent) printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent"); return -1; } > The effect is that in v5.8, the for_each_domain loop in > select_task_rq_fair can always end up at the global NUMA domain, and thus > consider any pair of waking cpu (cpu) and previous cpus of the wakee > (prev_cpu) as arguments to wake_affine. Up to v5.7, this was only > possible if cpu and prev_cpu were together in some lower level domain, ie > sharing the LLC. The effect is that in v5.8 wake_affine can often end up > choosing as a target a core that does not share the LLC with the core > where the thread ran previously. Threads then move around a lot between > the different sockets. > > Was this intentional? > AFAICT it isn't forbidden for the logic here to peek outside of the previous LLC. The NUMA reclaim distance thing says we allow affine wakeups and fork / exec balancing to move a task to a CPU at most RECLAIM_DISTANCE away (in NUMA distance values). However, I don't remember any patch changing this between v5.7 and v5.8. Briefly glancing over the kernel/sched log between v5.7 and v5.8, I don't see any obvious culprits. Did you try to bisect this? If it indeed ends on the SD_LOAD_BALANCE thing, well, I'll be off eating my keyboard. > The effect can be seen in the traces of the parsec vips benchmark at the > following URL: > > https://pages.lip6.fr/Julia.Lawall/vips.pdf > > The first two graphs (complete run and small fragment) are Linux v5.7 and > the next two are Linux v5.8. The machine has 160 hardware threads > organized in 4 sockets and the colors are by socket. In the small > fragment for v5.7 (second graph), one can see that a given pid pretty much > stays on the same socket, while in the corresponding fragment for v5.8 > (fourth graph), the pids move around between the sockets. The x's > describe the unblocks that result in a migration. A pink x means that the > migration is in the same socket, while a blue x means that the migration > is to another socket. It's not apparent from the graphs, but by adding > some tracing, it seems that the new socket is always the one of the core > that handles the wakeup. > Interesting graphs, thanks for sharing! > I haven't yet studied the early part of the execution of vips in detail, > but I suspect that the same issue causes all of the threads to be > initially on the same socket in v5.7, while in v5.8 they are more quickly > dispersed to other sockets. > > My impression from the parsec and the NAS benchmarks is that the v5.8 > performance is a bit better than v5.7, probably because of getting more > threads to more different sockets earlier, but other benchmarks might > rely more on locality and might react less well to threads moving around > so much in this way. > > julia