I had an application suddenly stop making progress.  By killing the last 
process out of 208 processes, then looking at the stack trace, I found 3 of 208 
processes were in an MPI_REDUCE call.  The other 205 had progressed in their 
execution to another routine, where they were waiting in an unrelated 
MPI_ALLREDUCE call.

The code structure is such that each processes calls MPI_REDUCE 5 times for 
different variables, then some work is done, then the MPI_ALLREDUCE call 
happens early in the next iteration of the solution procedure.  I thought it 
was also noteworthy that the 3 processes stuck at MPI_REDUCE, were actually 
stuck on the 4th of 5 MPI_REDUCE calls, not the 5th call.

No issues with MVAPICH.  Problem easily solved by adding MPI_BARRIER after the 
section of MPI_REDUCE calls.

It seems like MPI_REDUCE has some kind of non-blocking implementation, and it 
was not safe to enter the MPI_ALLREDUCE while those MPI_REDUCE calls had not 
yet completed for other processes.

This was in OpenMPI 1.8.1.  Same problem seen on 3 slightly different systems, 
all QDR Infiniband, Mellanox HCAs, using a Mellanox OFED stack (slightly 
different versions on each cluster).  Intel compilers, again slightly different 
versions on each of the 3 systems.

Has anyone encountered anything similar?  While I have a workaround, I want to 
make sure the root cause of the deadlock gets fixed.  Please let me know what I 
can do to help.

Thanks,

Ed

Reply via email to