Ed, how large are the messages that you are sending and receiving? Rolf From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ed Blosch Sent: Thursday, June 27, 2013 9:01 AM To: us...@open-mpi.org Subject: Re: [OMPI users] Application hangs on mpi_waitall
It ran a bit longer but still deadlocked. All matching sends are posted 1:1with posted recvs so it is a delivery issue of some kind. I'm running a debug compiled version tonight to see what that might turn up. I may try to rewrite with blocking sends and see if that works. I can also try adding a barrier (irecvs, barrier, isends, waitall) to make sure sends are not buffering waiting for recvs to be posted. Sent via the Samsung Galaxy S™ III, an AT&T 4G LTE smartphone -------- Original message -------- From: George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> Date: To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>> Subject: Re: [OMPI users] Application hangs on mpi_waitall Ed, Im not sure but there might be a case where the BTL is getting overwhelmed by the nob-blocking operations while trying to setup the connection. There is a simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before you start posting the non-blocking receives, and let's see if this solves your issue. George. On Jun 26, 2013, at 04:02 , eblo...@1scom.net<mailto:eblo...@1scom.net> wrote: > An update: I recoded the mpi_waitall as a loop over the requests with > mpi_test and a 30 second timeout. The timeout happens unpredictably, > sometimes after 10 minutes of run time, other times after 15 minutes, for > the exact same case. > > After 30 seconds, I print out the status of all outstanding receive > requests. The message tags that are outstanding have definitely been > sent, so I am wondering why they are not getting received? > > As I said before, everybody posts non-blocking standard receives, then > non-blocking standard sends, then calls mpi_waitall. Each process is > typically waiting on 200 to 300 requests. Is deadlock possible via this > implementation approach under some kind of unusual conditions? > > Thanks again, > > Ed > >> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never >> returns. The case runs fine with MVAPICH. The logic associated with the >> communications has been extensively debugged in the past; we don't think >> it has errors. Each process posts non-blocking receives, non-blocking >> sends, and then does waitall on all the outstanding requests. >> >> The work is broken down into 960 chunks. If I run with 960 processes (60 >> nodes of 16 cores each), things seem to work. If I use 160 processes >> (each process handling 6 chunks of work), then each process is handling 6 >> times as much communication, and that is the case that hangs with OpenMPI >> 1.6.4; again, seems to work with MVAPICH. Is there an obvious place to >> start, diagnostically? We're using the openib btl. >> >> Thanks, >> >> Ed >> _______________________________________________ >> users mailing list >> us...@open-mpi.org<mailto:us...@open-mpi.org> >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org<mailto:us...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/users ----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------