I'm glad you figured this out.  Your mail was on my to-do list to reply to 
today; I didn't reply earlier simply because I had no idea what the problem 
could have been.  

I'm also kinda glad it wasn't related to MPI.  ;-)


On Feb 27, 2013, at 11:20 AM, Simon DeDeo <simon.de...@gmail.com> wrote:

> We've resolved this issue, which appears to have been an early warning of a 
> large-scale hardware failure. Twelve hours later the machine was unable to 
> power-on or self-test. 
> 
> We are now running on a new machine, and the same jobs are finishing normally 
> -- without having to worry about Send/Ssend/Isend buffering differences, and 
> relying solely on blocking communication.
> 
> Simon
> 
> Research Fellow
> Santa Fe Institute
> http://santafe.edu/~simon
> 
> On 25 Feb 2013, at 4:04 PM, Simon DeDeo <simon.de...@gmail.com> wrote:
> 
>> I have been having some trouble tracing the source of a CPU stall with open 
>> MPI on Gentoo.
>> 
>> My code is very simple: each process does a Monte Carlo run, saves some data 
>> to disk, and sends back a single MPI_DOUBLE to node zero, which picks the 
>> best value from all the computations (including the one it did itself).
>> 
>> For some reason, this can cause CPUs to "stall" (see the error below, on 
>> dmesg output) -- this stall actually causes the system to crash and reboot, 
>> which seems pretty crazy.
>> 
>> My best guess is that some of the nodes greater than zero have "MPI_Send"s 
>> out, but node zero is not finished with its own computation yet, and so has 
>> not put out an MPI_Recv. They get mad waiting? This happens when I give the 
>> Monte Carlo runs large numbers, and so the variance in end time is larger.
>> 
>> However, the behavior seems a bit extreme, and I am wondering if something 
>> more subtle is going on. My sysadmin was trying to fix something on the 
>> machine the last time it crashed, and it trashed the kernel! So I am also in 
>> the sysadmin doghouse.
>> 
>> Any help or advice greatly appreciated! Is it likely to be an 
>> MPI_Send/MPI_Recv problem, or is there something else going on?
>> 
>> [ 1273.079260] INFO: rcu_sched detected stalls on CPUs/tasks: { 12 13} 
>> (detected by 17, t=60002 jiffies)
>> [ 1273.079272] Pid: 2626, comm: cluster Not tainted 3.6.11-gentoo #10
>> [ 1273.079275] Call Trace:
>> [ 1273.079277]  <IRQ>  [<ffffffff81099b87>] rcu_check_callbacks+0x5a7/0x600
>> [ 1273.079294]  [<ffffffff8103fae3>] update_process_times+0x43/0x80
>> [ 1273.079298]  [<ffffffff8106d796>] tick_sched_timer+0x76/0xc0
>> [ 1273.079303]  [<ffffffff8105329e>] __run_hrtimer.isra.33+0x4e/0x100
>> [ 1273.079306]  [<ffffffff81053adb>] hrtimer_interrupt+0xeb/0x220
>> [ 1273.079311]  [<ffffffff8101fd94>] smp_apic_timer_interrupt+0x64/0xa0
>> [ 1273.079316]  [<ffffffff81515f07>] apic_timer_interrupt+0x67/0x70
>> [ 1273.079317]  <EOI>
>> 
>> Simon
>> 
>> Research Fellow
>> Santa Fe Institute
>> http://santafe.edu/~simon
>> 
>> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to