I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never returns. 
 The case runs fine with MVAPICH.  The logic associated with the communications 
has been extensively debugged in the past; we don't think it has errors.   Each 
process posts non-blocking receives, non-blocking sends, and then does waitall 
on all the outstanding requests.

The work is broken down into 960 chunks. If I run with 960 processes (60 nodes 
of 16 cores each), things seem to work.  If I use 160 processes (each process 
handling 6 chunks of work), then each process is handling 6 times as much 
communication, and that is the case that hangs with OpenMPI 1.6.4; again, seems 
to work with MVAPICH.  Is there an obvious place to start, diagnostically?  
We're using the openib btl.

Thanks,

Ed

Reply via email to