I have been having some trouble tracing the source of a CPU stall with open MPI on Gentoo.
My code is very simple: each process does a Monte Carlo run, saves some data to disk, and sends back a single MPI_DOUBLE to node zero, which picks the best value from all the computations (including the one it did itself). For some reason, this can cause CPUs to "stall" (see the error below, on dmesg output) -- this stall actually causes the system to crash and reboot, which seems pretty crazy. My best guess is that some of the nodes greater than zero have "MPI_Send"s out, but node zero is not finished with its own computation yet, and so has not put out an MPI_Recv. They get mad waiting? This happens when I give the Monte Carlo runs large numbers, and so the variance in end time is larger. However, the behavior seems a bit extreme, and I am wondering if something more subtle is going on. My sysadmin was trying to fix something on the machine the last time it crashed, and it trashed the kernel! So I am also in the sysadmin doghouse. Any help or advice greatly appreciated! Is it likely to be an MPI_Send/MPI_Recv problem, or is there something else going on? [ 1273.079260] INFO: rcu_sched detected stalls on CPUs/tasks: { 12 13} (detected by 17, t=60002 jiffies) [ 1273.079272] Pid: 2626, comm: cluster Not tainted 3.6.11-gentoo #10 [ 1273.079275] Call Trace: [ 1273.079277] <IRQ> [<ffffffff81099b87>] rcu_check_callbacks+0x5a7/0x600 [ 1273.079294] [<ffffffff8103fae3>] update_process_times+0x43/0x80 [ 1273.079298] [<ffffffff8106d796>] tick_sched_timer+0x76/0xc0 [ 1273.079303] [<ffffffff8105329e>] __run_hrtimer.isra.33+0x4e/0x100 [ 1273.079306] [<ffffffff81053adb>] hrtimer_interrupt+0xeb/0x220 [ 1273.079311] [<ffffffff8101fd94>] smp_apic_timer_interrupt+0x64/0xa0 [ 1273.079316] [<ffffffff81515f07>] apic_timer_interrupt+0x67/0x70 [ 1273.079317] <EOI> Simon Research Fellow Santa Fe Institute http://santafe.edu/~simon