NMI watchdog triggering during load_balance

David Ahern Thu, 05 Mar 2015 20:06:30 -0800

Hi Peter/Mike/Ingo:

I've been banging my against this wall for a week now and hoping you orsomeone could shed some light on the problem.

On larger systems (256 to 1024 cpus) there are several use cases (e.g.,http://www.cs.virginia.edu/stream/) that regularly trigger the NMIwatchdog with the stack trace:


Call Trace:
@  [000000000045d3d0] double_rq_lock+0x4c/0x68
@  [00000000004699c4] load_balance+0x278/0x740
@  [00000000008a7b88] __schedule+0x378/0x8e4
@  [00000000008a852c] schedule+0x68/0x78
@  [000000000042c82c] cpu_idle+0x14c/0x18c
@  [00000000008a3a50] after_lock_tlb+0x1b4/0x1cc

Capturing data for all CPUs I tend to see load_balance related stacktraces on 700-800 cpus, with a few hundred blocked on _raw_spin_trylock_bh.

I originally thought it was a deadlock in the rq locking, but if I bumpthe watchdog timeout the system eventually recovers (after 10-30+seconds of unresponsiveness) so it does not seem likely to be a deadlock.


This particluar system has 1024 cpus:
# lscpu
Architecture:          sparc64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Big Endian
CPU(s):                1024
On-line CPU(s) list:   0-1023
Thread(s) per core:    8
Core(s) per socket:    4
Socket(s):             32
NUMA node(s):          4
NUMA node0 CPU(s):     0-255
NUMA node1 CPU(s):     256-511
NUMA node2 CPU(s):     512-767
NUMA node3 CPU(s):     768-1023

and there are 4 scheduling domains. An example of the domain debugoutput (condensed for the email):


CPU970 attaching sched-domain:
 domain 0: span 968-975 level SIBLING
  groups: 8 single CPU groups
  domain 1: span 968-975 level MC
   groups: 1 group with 8 cpus
   domain 2: span 768-1023 level CPU
    groups: 32 groups with 8 cpus per group
    domain 3: span 0-1023 level NODE
     groups: 4 groups with 256 cpus per group

On an idle system (20 or so non-kernel threads such as mingetty, udev,...) perf top shows the task scheduler is consuming significant time:

PerfTop: 136580 irqs/sec kernel:99.9% exact: 0.0% [1000Hzcycles], (all, 1024 CPUs)

-----------------------------------------------------------------------

    20.22%  [kernel]  [k] find_busiest_group
    16.00%  [kernel]  [k] find_next_bit
     6.37%  [kernel]  [k] ktime_get_update_offsets
     5.70%  [kernel]  [k] ktime_get
...

This is a 2.6.39 kernel (yes, a relatively old one); 3.8 shows similarsymptoms. 3.18 is much better.

From what I can tell load balancing is happening non-stop and there isheavy contention in the run queue locks. I instrumented the rq lockingand under load (e.g, the stream test) regularly see single rq locks heldcontinuously for 2-3 seconds (e.g., at the end of the stream run whichhas 1024 threads and the process is terminating).

I have been staring at and instrumenting the scheduling code for days.It seems like the balancing of domains is regularly lining up on all oralmost all CPUs and it seems like the NODE domain causes the most damagesince it scans all cpus (ie., in rebalance_domains() each domain passtriggers a call to load_balance on all cpus at the same time). Just inrandom snapshots during a stream test I have seen 1 pass throughrebalance_domains take > 17 seconds (custom tracepoints to tag start andend).

Since each domain is a superset of the lower one each pass throughload_balance regularly repeats the processing of the previous domain(e.g., NODE domain repeats the cpus in the CPU domain). Then multiplyingthat across 1024 cpus and it seems like a of duplication.


Does that make sense or am I off in the weeds?

Thanks,
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

NMI watchdog triggering during load_balance

Reply via email to