I had an application suddenly stop making progress. By killing the last process out of 208 processes, then looking at the stack trace, I found 3 of 208 processes were in an MPI_REDUCE call. The other 205 had progressed in their execution to another routine, where they were waiting in an unrelated MPI_ALLREDUCE call.
The code structure is such that each processes calls MPI_REDUCE 5 times for different variables, then some work is done, then the MPI_ALLREDUCE call happens early in the next iteration of the solution procedure. I thought it was also noteworthy that the 3 processes stuck at MPI_REDUCE, were actually stuck on the 4th of 5 MPI_REDUCE calls, not the 5th call. No issues with MVAPICH. Problem easily solved by adding MPI_BARRIER after the section of MPI_REDUCE calls. It seems like MPI_REDUCE has some kind of non-blocking implementation, and it was not safe to enter the MPI_ALLREDUCE while those MPI_REDUCE calls had not yet completed for other processes. This was in OpenMPI 1.8.1. Same problem seen on 3 slightly different systems, all QDR Infiniband, Mellanox HCAs, using a Mellanox OFED stack (slightly different versions on each cluster). Intel compilers, again slightly different versions on each of the 3 systems. Has anyone encountered anything similar? While I have a workaround, I want to make sure the root cause of the deadlock gets fixed. Please let me know what I can do to help. Thanks, Ed