On Fri, Mar 06, 2015 at 11:37:11AM -0700, David Ahern wrote: > On 3/6/15 11:11 AM, Mike Galbraith wrote: > In responding earlier today I realized that the topology is all wrong as you > were pointing out. There should be 16 NUMA domains (4 memory controllers per > socket and 4 sockets). There should be 8 sibling cores. I will look into why > that is not getting setup properly and what we can do about fixing it.
So we changed the numa topology setup a while back; see commit cb83b629bae0 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support"). > But, I do not understand how the wrong topology is causing the NMI watchdog > to trigger. In the end there are still N domains, M groups per domain and P > cpus per group. Doesn't the balancing walk over all of them irrespective of > physical topology? Not quite; so for regular load balancing only the first CPU in the domain will iterate up. So if you have 4 'nodes' only 4 CPUs will iterate the entire machine, not all 1024. > Call Trace: > [000000000045dc30] double_rq_lock+0x4c/0x68 > [000000000046a23c] load_balance+0x278/0x740 > [00000000008aa178] __schedule+0x378/0x8e4 > [00000000008aab1c] schedule+0x68/0x78 > [00000000004718ac] do_exit+0x798/0x7c0 > [000000000047195c] do_group_exit+0x88/0xc0 > [0000000000481148] get_signal_to_deliver+0x3ec/0x4c8 > [000000000042cbc0] do_signal+0x70/0x5e4 > [000000000042d14c] do_notify_resume+0x18/0x50 > [00000000004049c4] __handle_signal+0xc/0x2c > > > For example the stream program has 1024 threads (1 for each CPU). If you > ctrl-c the program or wait for it terminate that's when it trips. Other > workloads that routinely trip it are make -j N, N some number (e.g., on a > 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, ctrl-c > ... boom with the above stack trace. > > Code wise ... and this is still present in 3.18 and 3.20: > > schedule() > - __schedule() > + irqs disabled: raw_spin_lock_irq(&rq->lock); > > pick_next_task > - idle_balance() > For 2.6.39 it's the invocation of idle_balance which is triggering load > balancing with IRQs disabled. That's when the NMI watchdog trips. So for idle_balance() look at SD_BALANCE_NEWIDLE, only domains with that set will get iterated. I suppose you could try something like the below on 3.18 Which will disable SD_BALANCE_NEWDILE on all 'distant' nodes; but first check how your fixed numa topology looks and if you trigger that case at all. --- kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 17141da77c6e..7fce683928fe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6268,6 +6268,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { sd->flags &= ~(SD_BALANCE_EXEC | SD_BALANCE_FORK | + SD_BALANCE_NEWIDLE | SD_WAKE_AFFINE); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/