Re: [OMPI users] EXTERNAL: Re: Application hangs on mpi_waitall

Blosch, Edwin L Thu, 27 Jun 2013 17:03:38 -0400

I tried excluding openib but it did not succeed.  It actually made about the 
same progress as previously using the openib interface before hanging (I mean, 
my 30 second timeout period expired).

I'm more than happy to try out any other suggestions...

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of George Bosilca
Sent: Thursday, June 27, 2013 2:57 PM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: Application hangs on mpi_waitall

This seems to highlight a possible bug in the MPI implementation. As I 
suggested earlier, the credit management of the OpenIB might be unsafe.

To confirm this one last test to run. Let's prevent the OpenIB support from 
being used during the run (thus Open MPI will fall back to TCP). I suppose you 
should have ethernet cards in your cluster or you have IBoIP. Add "--mca btl 
^openib" to your mpirun command. If this allows your application to run to 
completion then we know exactly where to start looking.

  George.

On Jun 27, 2013, at 19:59 , "Blosch, Edwin L" 
<edwin.l.blo...@lmco.com<mailto:edwin.l.blo...@lmco.com>> wrote:

The debug version also hung, roughly the same amount of progress in the 
computations (although of course it took much longer to make that progress in 
comparison to the optimized version).

On the bright side, the idea of putting an mpi_barrier after the irecvs and 
before the isends appears to have helped.  I was able to run 5 times farther 
without any trouble.  So now I'm trying to run 50 times farther and, if no 
hang, I will declare workaround-victory.

What could this mean?

I am guessing that one or more processes may run ahead of the others, just 
because of the different amounts of work that precedes the communication step.  
If a process manages to post all its irecvs and post all its isends well before 
another process has managed to post any matching irecvs, perhaps there is some 
buffering resource on the sender side that is getting exhausted?   This is pure 
guessing on my part.

Thanks

From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[mailto:users-boun...@open-mpi.org<mailto:boun...@open-mpi.org>] On Behalf Of 
Ed Blosch
Sent: Thursday, June 27, 2013 8:01 AM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: EXTERNAL: Re: [OMPI users] Application hangs on mpi_waitall

It ran a bit longer but still deadlocked.  All matching sends are posted 
1:1with posted recvs so it is a delivery issue of some kind.  I'm running a 
debug compiled version tonight to see what that might turn up.  I may try to 
rewrite with blocking sends and see if that works.  I can also try adding a 
barrier (irecvs, barrier, isends, waitall) to make sure sends are not buffering 
waiting for recvs to be posted.

Sent via the Samsung Galaxy S(tm) III, an AT&T 4G LTE smartphone

-------- Original message --------
From: George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>>
List-Post: users@lists.open-mpi.org
Date:
To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] Application hangs on mpi_waitall

Ed,

Im not sure but there might be a case where the BTL is getting overwhelmed by 
the nob-blocking operations while trying to setup the connection. There is a 
simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before 
you start posting the non-blocking receives, and let's see if this solves your 
issue.

  George.

On Jun 26, 2013, at 04:02 , eblo...@1scom.net<mailto:eblo...@1scom.net> wrote:

> An update: I recoded the mpi_waitall as a loop over the requests with
> mpi_test and a 30 second timeout.  The timeout happens unpredictably,
> sometimes after 10 minutes of run time, other times after 15 minutes, for
> the exact same case.
>
> After 30 seconds, I print out the status of all outstanding receive
> requests.  The message tags that are outstanding have definitely been
> sent, so I am wondering why they are not getting received?
>
> As I said before, everybody posts non-blocking standard receives, then
> non-blocking standard sends, then calls mpi_waitall. Each process is
> typically waiting on 200 to 300 requests. Is deadlock possible via this
> implementation approach under some kind of unusual conditions?
>
> Thanks again,
>
> Ed
>
>> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
>> returns.  The case runs fine with MVAPICH.  The logic associated with the
>> communications has been extensively debugged in the past; we don't think
>> it has errors.   Each process posts non-blocking receives, non-blocking
>> sends, and then does waitall on all the outstanding requests.
>>
>> The work is broken down into 960 chunks. If I run with 960 processes (60
>> nodes of 16 cores each), things seem to work.  If I use 160 processes
>> (each process handling 6 chunks of work), then each process is handling 6
>> times as much communication, and that is the case that hangs with OpenMPI
>> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
>> start, diagnostically?  We're using the openib btl.
>>
>> Thanks,
>>
>> Ed
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] EXTERNAL: Re: Application hangs on mpi_waitall

Reply via email to