Re: NMI watchdog triggering during load_balance
On 3/6/15 12:29 PM, Mike Galbraith wrote: On Fri, 2015-03-06 at 11:37 -0700, David Ahern wrote: But, I do not understand how the wrong topology is causing the NMI watchdog to trigger. In the end there are still N domains, M groups per domain and P cpus per group. Doesn't the balancing walk over all of them irrespective of physical topology? You have this size extra large CPU domain that you shouldn't have, massive collisions therein ensue. I was able to get the socket/cores/threads issue resolved, so the topology is correct. But still need to check out a few things. Thanks Mike and Peter for the suggestions. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On Fri, Mar 06, 2015 at 11:37:11AM -0700, David Ahern wrote: > On 3/6/15 11:11 AM, Mike Galbraith wrote: > In responding earlier today I realized that the topology is all wrong as you > were pointing out. There should be 16 NUMA domains (4 memory controllers per > socket and 4 sockets). There should be 8 sibling cores. I will look into why > that is not getting setup properly and what we can do about fixing it. So we changed the numa topology setup a while back; see commit cb83b629bae0 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support"). > But, I do not understand how the wrong topology is causing the NMI watchdog > to trigger. In the end there are still N domains, M groups per domain and P > cpus per group. Doesn't the balancing walk over all of them irrespective of > physical topology? Not quite; so for regular load balancing only the first CPU in the domain will iterate up. So if you have 4 'nodes' only 4 CPUs will iterate the entire machine, not all 1024. > Call Trace: > [0045dc30] double_rq_lock+0x4c/0x68 > [0046a23c] load_balance+0x278/0x740 > [008aa178] __schedule+0x378/0x8e4 > [008aab1c] schedule+0x68/0x78 > [004718ac] do_exit+0x798/0x7c0 > [0047195c] do_group_exit+0x88/0xc0 > [00481148] get_signal_to_deliver+0x3ec/0x4c8 > [0042cbc0] do_signal+0x70/0x5e4 > [0042d14c] do_notify_resume+0x18/0x50 > [004049c4] __handle_signal+0xc/0x2c > > > For example the stream program has 1024 threads (1 for each CPU). If you > ctrl-c the program or wait for it terminate that's when it trips. Other > workloads that routinely trip it are make -j N, N some number (e.g., on a > 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, ctrl-c > ... boom with the above stack trace. > > Code wise ... and this is still present in 3.18 and 3.20: > > schedule() > - __schedule() > + irqs disabled: raw_spin_lock_irq(&rq->lock); > > pick_next_task > - idle_balance() > For 2.6.39 it's the invocation of idle_balance which is triggering load > balancing with IRQs disabled. That's when the NMI watchdog trips. So for idle_balance() look at SD_BALANCE_NEWIDLE, only domains with that set will get iterated. I suppose you could try something like the below on 3.18 Which will disable SD_BALANCE_NEWDILE on all 'distant' nodes; but first check how your fixed numa topology looks and if you trigger that case at all. --- kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 17141da77c6e..7fce683928fe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6268,6 +6268,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { sd->flags &= ~(SD_BALANCE_EXEC | SD_BALANCE_FORK | + SD_BALANCE_NEWIDLE | SD_WAKE_AFFINE); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On Fri, 2015-03-06 at 11:37 -0700, David Ahern wrote: > But, I do not understand how the wrong topology is causing the NMI > watchdog to trigger. In the end there are still N domains, M groups per > domain and P cpus per group. Doesn't the balancing walk over all of them > irrespective of physical topology? You have this size extra large CPU domain that you shouldn't have, massive collisions therein ensue. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On 3/6/15 11:11 AM, Mike Galbraith wrote: That was the question, _do_ you have any control, because that topology is toxic. I guess your reply means 'nope'. The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8 threads per core and each cpu has 4 memory controllers. Thank god I've never met one of these, looks like the box from hell :) If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a noticeable improvement -- watchdog does not trigger and I do not get the rq locks held for 2-3 seconds. But there is still fairly high cpu usage for an idle system. Perhaps I should leave SCHED_MC on and disable SCHED_SMT; I'll try that today. Well, if you disable SMT,your troubles _should_ shrink radically, as your box does. You should probably look at why you have CPU domains. You don't ever want to see that on a NUMA box. In responding earlier today I realized that the topology is all wrong as you were pointing out. There should be 16 NUMA domains (4 memory controllers per socket and 4 sockets). There should be 8 sibling cores. I will look into why that is not getting setup properly and what we can do about fixing it. -- But, I do not understand how the wrong topology is causing the NMI watchdog to trigger. In the end there are still N domains, M groups per domain and P cpus per group. Doesn't the balancing walk over all of them irrespective of physical topology? Here's another data point that jelled this morning explaining the problem to someone: the NMI watchdog trips on a mass exit: TPC: <_raw_spin_trylock_bh+0x38/0x100> g0: 7fff g1: 00ff g2: 00070f8c g3: fffe403b97891c98 g4: fffe803b963eda00 g5: 00010036c000 g6: fffe803b84108000 g7: 0093 o0: 0fe0 o1: 0fe0 o2: ff00 o3: 00200200 o4: 00a98080 o5: sp: fffe803b8410ada1 ret_pc: 006800dc RPC: l0: 00e9b114 l1: 0001 l2: 0001 l3: 0005 l4: 2000 l5: fffe803b8410b990 l6: 0004 l7: 00f267b0 i0: 000100b10700 i1: i2: 000101324d80 i3: fffe803b8410b6c0 i4: 0038 i5: 0498 i6: fffe803b8410ae51 i7: 0045dc30 I7: Call Trace: [0045dc30] double_rq_lock+0x4c/0x68 [0046a23c] load_balance+0x278/0x740 [008aa178] __schedule+0x378/0x8e4 [008aab1c] schedule+0x68/0x78 [004718ac] do_exit+0x798/0x7c0 [0047195c] do_group_exit+0x88/0xc0 [00481148] get_signal_to_deliver+0x3ec/0x4c8 [0042cbc0] do_signal+0x70/0x5e4 [0042d14c] do_notify_resume+0x18/0x50 [004049c4] __handle_signal+0xc/0x2c For example the stream program has 1024 threads (1 for each CPU). If you ctrl-c the program or wait for it terminate that's when it trips. Other workloads that routinely trip it are make -j N, N some number (e.g., on a 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, ctrl-c ... boom with the above stack trace. Code wise ... and this is still present in 3.18 and 3.20: schedule() - __schedule() + irqs disabled: raw_spin_lock_irq(&rq->lock); pick_next_task - idle_balance() + irqs enabled: different task: context_switch(rq, prev, next) --> finish_lock_switch eventually same task: raw_spin_unlock_irq(&rq->lock) or For 2.6.39 it's the invocation of idle_balance which is triggering load balancing with IRQs disabled. That's when the NMI watchdog trips. I'll pound on 3.18 and see if I can reproduce something similar there. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On Fri, 2015-03-06 at 08:01 -0700, David Ahern wrote: > On 3/5/15 9:52 PM, Mike Galbraith wrote: > >> CPU970 attaching sched-domain: > >>domain 0: span 968-975 level SIBLING > >> groups: 8 single CPU groups > >> domain 1: span 968-975 level MC > >> groups: 1 group with 8 cpus > >> domain 2: span 768-1023 level CPU > >> groups: 4 groups with 256 cpus per group > > > > Wow, that topology is horrid. I'm not surprised that your box is > > writhing in agony. Can you twiddle that? > > > > twiddle that how? That was the question, _do_ you have any control, because that topology is toxic. I guess your reply means 'nope'. > The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8 > threads per core and each cpu has 4 memory controllers. Thank god I've never met one of these, looks like the box from hell :) > If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a > noticeable improvement -- watchdog does not trigger and I do not get the > rq locks held for 2-3 seconds. But there is still fairly high cpu usage > for an idle system. Perhaps I should leave SCHED_MC on and disable > SCHED_SMT; I'll try that today. Well, if you disable SMT,your troubles _should_ shrink radically, as your box does. You should probably look at why you have CPU domains. You don't ever want to see that on a NUMA box. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On 3/6/15 2:12 AM, Peter Zijlstra wrote: On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote: Socket(s): 32 NUMA node(s): 4 Urgh, with 32 'cpus' per socket, you still do _8_ sockets per node, for a total of 256 cpus per node. Per the response to Mike, the system has 4 physical cpus. Each cpu has 32 cores with 8 threads per core and 4 memory controllers (one mcu per 8 cores). Yes there are 256 logical cpus per node. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On 3/6/15 2:07 AM, Peter Zijlstra wrote: On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote: Since each domain is a superset of the lower one each pass through load_balance regularly repeats the processing of the previous domain (e.g., NODE domain repeats the cpus in the CPU domain). Then multiplying that across 1024 cpus and it seems like a of duplication. It is, _but_ each domain has an interval, bigger domains _should_ load balance at a bigger interval (iow lower frequency), and all this is lockless data gathering, so reusing stuff from the previous round could be quite stale indeed. Yes and I have twiddled the intervals. The defaults for min_interval and max_interval (msec): SMT 1 2 MC 1 4 CPU 1 4 NODE 8 32 Increasing those values (e.g. moving NODE to 50 and 100) drops idle time cpu usage but does not solve the fundamental problem -- under load the balancing of domains seems to be lining up and the system comes to a halt in load balancing frenzy. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On 3/6/15 1:51 AM, Peter Zijlstra wrote: On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote: Hi Peter/Mike/Ingo: Does that make sense or am I off in the weeds? How much of your story pertains to 3.18? I'm not particularly interested in anything much older than that. No. All of the data in the opening email are from 2.6.39. Each kernel (2.6.39, 3.8 and 3.18) has a different performance problem. I will look at 3.18 in depth soon, but from what I can see the fundamental concepts of the load balancing have not changed (e.g., my tracepoints from 2.6.39 still apply to 3.18). David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On 3/5/15 9:52 PM, Mike Galbraith wrote: CPU970 attaching sched-domain: domain 0: span 968-975 level SIBLING groups: 8 single CPU groups domain 1: span 968-975 level MC groups: 1 group with 8 cpus domain 2: span 768-1023 level CPU groups: 4 groups with 256 cpus per group Wow, that topology is horrid. I'm not surprised that your box is writhing in agony. Can you twiddle that? twiddle that how? The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8 threads per core and each cpu has 4 memory controllers. If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a noticeable improvement -- watchdog does not trigger and I do not get the rq locks held for 2-3 seconds. But there is still fairly high cpu usage for an idle system. Perhaps I should leave SCHED_MC on and disable SCHED_SMT; I'll try that today. Thanks, David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote: > Socket(s): 32 > NUMA node(s): 4 Urgh, with 32 'cpus' per socket, you still do _8_ sockets per node, for a total of 256 cpus per node. That's painful. I don't suppose you can really change the hardware, but that's a 'curious' choice. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote: > Since each domain is a superset of the lower one each pass through > load_balance regularly repeats the processing of the previous domain (e.g., > NODE domain repeats the cpus in the CPU domain). Then multiplying that > across 1024 cpus and it seems like a of duplication. It is, _but_ each domain has an interval, bigger domains _should_ load balance at a bigger interval (iow lower frequency), and all this is lockless data gathering, so reusing stuff from the previous round could be quite stale indeed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote: > Hi Peter/Mike/Ingo: > > Does that make sense or am I off in the weeds? How much of your story pertains to 3.18? I'm not particularly interested in anything much older than that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NMI watchdog triggering during load_balance
On Thu, 2015-03-05 at 21:05 -0700, David Ahern wrote: > Hi Peter/Mike/Ingo: > > I've been banging my against this wall for a week now and hoping you or > someone could shed some light on the problem. > > On larger systems (256 to 1024 cpus) there are several use cases (e.g., > http://www.cs.virginia.edu/stream/) that regularly trigger the NMI > watchdog with the stack trace: > > Call Trace: > @ [0045d3d0] double_rq_lock+0x4c/0x68 > @ [004699c4] load_balance+0x278/0x740 > @ [008a7b88] __schedule+0x378/0x8e4 > @ [008a852c] schedule+0x68/0x78 > @ [0042c82c] cpu_idle+0x14c/0x18c > @ [008a3a50] after_lock_tlb+0x1b4/0x1cc > > Capturing data for all CPUs I tend to see load_balance related stack > traces on 700-800 cpus, with a few hundred blocked on _raw_spin_trylock_bh. > > I originally thought it was a deadlock in the rq locking, but if I bump > the watchdog timeout the system eventually recovers (after 10-30+ > seconds of unresponsiveness) so it does not seem likely to be a deadlock. > > This particluar system has 1024 cpus: > # lscpu > Architecture: sparc64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Big Endian > CPU(s):1024 > On-line CPU(s) list: 0-1023 > Thread(s) per core:8 > Core(s) per socket:4 > Socket(s): 32 > NUMA node(s): 4 > NUMA node0 CPU(s): 0-255 > NUMA node1 CPU(s): 256-511 > NUMA node2 CPU(s): 512-767 > NUMA node3 CPU(s): 768-1023 > > and there are 4 scheduling domains. An example of the domain debug > output (condensed for the email): > > CPU970 attaching sched-domain: > domain 0: span 968-975 level SIBLING >groups: 8 single CPU groups >domain 1: span 968-975 level MC > groups: 1 group with 8 cpus > domain 2: span 768-1023 level CPU > groups: 4 groups with 256 cpus per group Wow, that topology is horrid. I'm not surprised that your box is writhing in agony. Can you twiddle that? -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/