At this point I'm running out of ideas … Can I have a simple reproducer of this issue? If possible send me the code and I'll try to dig a little more to see what the problem is.
George. On Jun 27, 2013, at 23:02 , "Blosch, Edwin L" <edwin.l.blo...@lmco.com> wrote: > I tried excluding openib but it did not succeed. It actually made about the > same progress as previously using the openib interface before hanging (I > mean, my 30 second timeout period expired). > > I’m more than happy to try out any other suggestions… > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of George Bosilca > Sent: Thursday, June 27, 2013 2:57 PM > To: Open MPI Users > Subject: Re: [OMPI users] EXTERNAL: Re: Application hangs on mpi_waitall > > This seems to highlight a possible bug in the MPI implementation. As I > suggested earlier, the credit management of the OpenIB might be unsafe. > > To confirm this one last test to run. Let's prevent the OpenIB support from > being used during the run (thus Open MPI will fall back to TCP). I suppose > you should have ethernet cards in your cluster or you have IBoIP. Add "--mca > btl ^openib" to your mpirun command. If this allows your application to run > to completion then we know exactly where to start looking. > > George. > > On Jun 27, 2013, at 19:59 , "Blosch, Edwin L" <edwin.l.blo...@lmco.com> wrote: > > > The debug version also hung, roughly the same amount of progress in the > computations (although of course it took much longer to make that progress in > comparison to the optimized version). > > On the bright side, the idea of putting an mpi_barrier after the irecvs and > before the isends appears to have helped. I was able to run 5 times farther > without any trouble. So now I’m trying to run 50 times farther and, if no > hang, I will declare workaround-victory. > > What could this mean? > > I am guessing that one or more processes may run ahead of the others, just > because of the different amounts of work that precedes the communication > step. If a process manages to post all its irecvs and post all its isends > well before another process has managed to post any matching irecvs, perhaps > there is some buffering resource on the sender side that is getting > exhausted? This is pure guessing on my part. > > Thanks > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ed Blosch > Sent: Thursday, June 27, 2013 8:01 AM > To: us...@open-mpi.org > Subject: EXTERNAL: Re: [OMPI users] Application hangs on mpi_waitall > > It ran a bit longer but still deadlocked. All matching sends are posted > 1:1with posted recvs so it is a delivery issue of some kind. I'm running a > debug compiled version tonight to see what that might turn up. I may try to > rewrite with blocking sends and see if that works. I can also try adding a > barrier (irecvs, barrier, isends, waitall) to make sure sends are not > buffering waiting for recvs to be posted. > > > Sent via the Samsung Galaxy S™ III, an AT&T 4G LTE smartphone > > > > -------- Original message -------- > From: George Bosilca <bosi...@icl.utk.edu> > Date: > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] Application hangs on mpi_waitall > > > Ed, > > Im not sure but there might be a case where the BTL is getting overwhelmed by > the nob-blocking operations while trying to setup the connection. There is a > simple test for this. Add an MPI_Alltoall with a reasonable size (100k) > before you start posting the non-blocking receives, and let's see if this > solves your issue. > > George. > > > On Jun 26, 2013, at 04:02 , eblo...@1scom.net wrote: > > > An update: I recoded the mpi_waitall as a loop over the requests with > > mpi_test and a 30 second timeout. The timeout happens unpredictably, > > sometimes after 10 minutes of run time, other times after 15 minutes, for > > the exact same case. > > > > After 30 seconds, I print out the status of all outstanding receive > > requests. The message tags that are outstanding have definitely been > > sent, so I am wondering why they are not getting received? > > > > As I said before, everybody posts non-blocking standard receives, then > > non-blocking standard sends, then calls mpi_waitall. Each process is > > typically waiting on 200 to 300 requests. Is deadlock possible via this > > implementation approach under some kind of unusual conditions? > > > > Thanks again, > > > > Ed > > > >> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never > >> returns. The case runs fine with MVAPICH. The logic associated with the > >> communications has been extensively debugged in the past; we don't think > >> it has errors. Each process posts non-blocking receives, non-blocking > >> sends, and then does waitall on all the outstanding requests. > >> > >> The work is broken down into 960 chunks. If I run with 960 processes (60 > >> nodes of 16 cores each), things seem to work. If I use 160 processes > >> (each process handling 6 chunks of work), then each process is handling 6 > >> times as much communication, and that is the case that hangs with OpenMPI > >> 1.6.4; again, seems to work with MVAPICH. Is there an obvious place to > >> start, diagnostically? We're using the openib btl. > >> > >> Thanks, > >> > >> Ed > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users