I have been having some trouble tracing the source of a CPU stall with open MPI 
on Gentoo.

My code is very simple: each process does a Monte Carlo run, saves some data to 
disk, and sends back a single MPI_DOUBLE to node zero, which picks the best 
value from all the computations (including the one it did itself).

For some reason, this can cause CPUs to "stall" (see the error below, on dmesg 
output) -- this stall actually causes the system to crash and reboot, which 
seems pretty crazy.

My best guess is that some of the nodes greater than zero have "MPI_Send"s out, 
but node zero is not finished with its own computation yet, and so has not put 
out an MPI_Recv. They get mad waiting? This happens when I give the Monte Carlo 
runs large numbers, and so the variance in end time is larger.

However, the behavior seems a bit extreme, and I am wondering if something more 
subtle is going on. My sysadmin was trying to fix something on the machine the 
last time it crashed, and it trashed the kernel! So I am also in the sysadmin 
doghouse.

Any help or advice greatly appreciated! Is it likely to be an MPI_Send/MPI_Recv 
problem, or is there something else going on?

[ 1273.079260] INFO: rcu_sched detected stalls on CPUs/tasks: { 12 13} 
(detected by 17, t=60002 jiffies)
[ 1273.079272] Pid: 2626, comm: cluster Not tainted 3.6.11-gentoo #10
[ 1273.079275] Call Trace:
[ 1273.079277]  <IRQ>  [<ffffffff81099b87>] rcu_check_callbacks+0x5a7/0x600
[ 1273.079294]  [<ffffffff8103fae3>] update_process_times+0x43/0x80
[ 1273.079298]  [<ffffffff8106d796>] tick_sched_timer+0x76/0xc0
[ 1273.079303]  [<ffffffff8105329e>] __run_hrtimer.isra.33+0x4e/0x100
[ 1273.079306]  [<ffffffff81053adb>] hrtimer_interrupt+0xeb/0x220
[ 1273.079311]  [<ffffffff8101fd94>] smp_apic_timer_interrupt+0x64/0xa0
[ 1273.079316]  [<ffffffff81515f07>] apic_timer_interrupt+0x67/0x70
[ 1273.079317]  <EOI>

Simon

Research Fellow
Santa Fe Institute
http://santafe.edu/~simon



Reply via email to