Re: [OMPI users] EXTERNAL: Re: Application hangs on mpi_waitall

George Bosilca Thu, 27 Jun 2013 17:15:30 -0400

At this point I'm running out of ideas … Can I have a simple reproducer of this 
issue? If possible send me the code and I'll try to dig a little more to see 
what the problem is.


  George.


On Jun 27, 2013, at 23:02 , "Blosch, Edwin L" <edwin.l.blo...@lmco.com> wrote:

> I tried excluding openib but it did not succeed.  It actually made about the 
> same progress as previously using the openib interface before hanging (I 
> mean, my 30 second timeout period expired).
>  
> I’m more than happy to try out any other suggestions…
>  
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of George Bosilca
> Sent: Thursday, June 27, 2013 2:57 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] EXTERNAL: Re: Application hangs on mpi_waitall
>  
> This seems to highlight a possible bug in the MPI implementation. As I 
> suggested earlier, the credit management of the OpenIB might be unsafe.
>  
> To confirm this one last test to run. Let's prevent the OpenIB support from 
> being used during the run (thus Open MPI will fall back to TCP). I suppose 
> you should have ethernet cards in your cluster or you have IBoIP. Add "--mca 
> btl ^openib" to your mpirun command. If this allows your application to run 
> to completion then we know exactly where to start looking.
>  
>   George.
>  
> On Jun 27, 2013, at 19:59 , "Blosch, Edwin L" <edwin.l.blo...@lmco.com> wrote:
> 
> 
> The debug version also hung, roughly the same amount of progress in the 
> computations (although of course it took much longer to make that progress in 
> comparison to the optimized version).
>  
> On the bright side, the idea of putting an mpi_barrier after the irecvs and 
> before the isends appears to have helped.  I was able to run 5 times farther 
> without any trouble.  So now I’m trying to run 50 times farther and, if no 
> hang, I will declare workaround-victory.
>  
> What could this mean?
>  
> I am guessing that one or more processes may run ahead of the others, just 
> because of the different amounts of work that precedes the communication 
> step.  If a process manages to post all its irecvs and post all its isends 
> well before another process has managed to post any matching irecvs, perhaps 
> there is some buffering resource on the sender side that is getting 
> exhausted?   This is pure guessing on my part.
>  
> Thanks
>  
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Ed Blosch
> Sent: Thursday, June 27, 2013 8:01 AM
> To: us...@open-mpi.org
> Subject: EXTERNAL: Re: [OMPI users] Application hangs on mpi_waitall
>  
> It ran a bit longer but still deadlocked.  All matching sends are posted 
> 1:1with posted recvs so it is a delivery issue of some kind.  I'm running a 
> debug compiled version tonight to see what that might turn up.  I may try to 
> rewrite with blocking sends and see if that works.  I can also try adding a 
> barrier (irecvs, barrier, isends, waitall) to make sure sends are not 
> buffering waiting for recvs to be posted.
>  
>  
> Sent via the Samsung Galaxy S™ III, an AT&T 4G LTE smartphone
> 
> 
> 
> -------- Original message --------
> From: George Bosilca <bosi...@icl.utk.edu> 
> Date: 
> To: Open MPI Users <us...@open-mpi.org> 
> Subject: Re: [OMPI users] Application hangs on mpi_waitall 
> 
> 
> Ed,
> 
> Im not sure but there might be a case where the BTL is getting overwhelmed by 
> the nob-blocking operations while trying to setup the connection. There is a 
> simple test for this. Add an MPI_Alltoall with a reasonable size (100k) 
> before you start posting the non-blocking receives, and let's see if this 
> solves your issue.
> 
>   George.
> 
> 
> On Jun 26, 2013, at 04:02 , eblo...@1scom.net wrote:
> 
> > An update: I recoded the mpi_waitall as a loop over the requests with
> > mpi_test and a 30 second timeout.  The timeout happens unpredictably,
> > sometimes after 10 minutes of run time, other times after 15 minutes, for
> > the exact same case.
> > 
> > After 30 seconds, I print out the status of all outstanding receive
> > requests.  The message tags that are outstanding have definitely been
> > sent, so I am wondering why they are not getting received?
> > 
> > As I said before, everybody posts non-blocking standard receives, then
> > non-blocking standard sends, then calls mpi_waitall. Each process is
> > typically waiting on 200 to 300 requests. Is deadlock possible via this
> > implementation approach under some kind of unusual conditions?
> > 
> > Thanks again,
> > 
> > Ed
> > 
> >> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
> >> returns.  The case runs fine with MVAPICH.  The logic associated with the
> >> communications has been extensively debugged in the past; we don't think
> >> it has errors.   Each process posts non-blocking receives, non-blocking
> >> sends, and then does waitall on all the outstanding requests.
> >> 
> >> The work is broken down into 960 chunks. If I run with 960 processes (60
> >> nodes of 16 cores each), things seem to work.  If I use 160 processes
> >> (each process handling 6 chunks of work), then each process is handling 6
> >> times as much communication, and that is the case that hangs with OpenMPI
> >> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
> >> start, diagnostically?  We're using the openib btl.
> >> 
> >> Thanks,
> >> 
> >> Ed
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] EXTERNAL: Re: Application hangs on mpi_waitall

Reply via email to