Jonathan Dursi wrote:
Continuing the conversation with myself:
Sorry to interrupt... :^)
Okay, I managed to reproduce the hang. I'll try to look at this.
Google pointed me to Trac ticket #1944, which spoke of deadlocks in
looped collective operations; there is no collective operation
anywhere in this sample code, but trying one of the suggested
workarounds/clues: that is, setting btl_sm_num_fifos to at least
(np-1) seems to make things work quite reliably, for both OpenMPI
1.3.2 and 1.3.3; that is, while this
mpirun -np 6 -mca btl sm,self ./diffusion-mpi
invariably hangs (at random-seeming numbers of iterations) with
OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again
seemingly randomly) with 1.3.3,
mpirun -np 6 -mca btl tcp,self ./diffusion-mpi
or
mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
always succeeds, with (as one might guess) the second being much
faster...
Jonathan