Ed, Im not sure but there might be a case where the BTL is getting overwhelmed by the nob-blocking operations while trying to setup the connection. There is a simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before you start posting the non-blocking receives, and let's see if this solves your issue.
George. On Jun 26, 2013, at 04:02 , eblo...@1scom.net wrote: > An update: I recoded the mpi_waitall as a loop over the requests with > mpi_test and a 30 second timeout. The timeout happens unpredictably, > sometimes after 10 minutes of run time, other times after 15 minutes, for > the exact same case. > > After 30 seconds, I print out the status of all outstanding receive > requests. The message tags that are outstanding have definitely been > sent, so I am wondering why they are not getting received? > > As I said before, everybody posts non-blocking standard receives, then > non-blocking standard sends, then calls mpi_waitall. Each process is > typically waiting on 200 to 300 requests. Is deadlock possible via this > implementation approach under some kind of unusual conditions? > > Thanks again, > > Ed > >> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never >> returns. The case runs fine with MVAPICH. The logic associated with the >> communications has been extensively debugged in the past; we don't think >> it has errors. Each process posts non-blocking receives, non-blocking >> sends, and then does waitall on all the outstanding requests. >> >> The work is broken down into 960 chunks. If I run with 960 processes (60 >> nodes of 16 cores each), things seem to work. If I use 160 processes >> (each process handling 6 chunks of work), then each process is handling 6 >> times as much communication, and that is the case that hangs with OpenMPI >> 1.6.4; again, seems to work with MVAPICH. Is there an obvious place to >> start, diagnostically? We're using the openib btl. >> >> Thanks, >> >> Ed >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users