Re: [OMPI users] Application hangs on mpi_waitall

2013-06-27 Thread Blosch, Edwin L
Also, just to be clear, that attached listing is ordered by data in the first 
column and doesn’t reflect the call sequence.  In actual implementation, all 
the messages labeled “mpi-recv” are mpi_irecv and are all posted before any of 
the mpi_isends are posted.

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Blosch, Edwin L
Sent: Thursday, June 27, 2013 12:48 PM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] Application hangs on mpi_waitall

Attached is the message list for rank 0 for the communication step that is 
failing.  There are about 160 isends and irecvs.  The ‘message size’ is 
actually a number of cells.  On some steps only one 8-byte word per cell is 
communicated, at another step we exchange 7 words, and another step we exchange 
21 words.  You can see the smallest is 10 cells, the largest around 1000 cells.

Thus for the 7-word communication step, the smallest messages are 560 bytes, 
the largest are 56000 bytes, and there is a distribution of sizes.  For the 
single-word communication step, the size distribution would be from 80 bytes to 
8000 and for the 21-word step it would be from 1680 to 168000 bytes.

From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[mailto:users-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Thursday, June 27, 2013 9:02 AM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] Application hangs on mpi_waitall

Ed, how large are the messages that you are sending and receiving?
Rolf

From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[mailto:users-boun...@open-mpi.org] On Behalf Of Ed Blosch
Sent: Thursday, June 27, 2013 9:01 AM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: Re: [OMPI users] Application hangs on mpi_waitall

It ran a bit longer but still deadlocked.  All matching sends are posted 
1:1with posted recvs so it is a delivery issue of some kind.  I'm running a 
debug compiled version tonight to see what that might turn up.  I may try to 
rewrite with blocking sends and see if that works.  I can also try adding a 
barrier (irecvs, barrier, isends, waitall) to make sure sends are not buffering 
waiting for recvs to be posted.


Sent via the Samsung Galaxy S™ III, an AT 4G LTE smartphone



 Original message 
From: George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>>
Date:
To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] Application hangs on mpi_waitall


Ed,

Im not sure but there might be a case where the BTL is getting overwhelmed by 
the nob-blocking operations while trying to setup the connection. There is a 
simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before 
you start posting the non-blocking receives, and let's see if this solves your 
issue.

  George.


On Jun 26, 2013, at 04:02 , eblo...@1scom.net<mailto:eblo...@1scom.net> wrote:

> An update: I recoded the mpi_waitall as a loop over the requests with
> mpi_test and a 30 second timeout.  The timeout happens unpredictably,
> sometimes after 10 minutes of run time, other times after 15 minutes, for
> the exact same case.
>
> After 30 seconds, I print out the status of all outstanding receive
> requests.  The message tags that are outstanding have definitely been
> sent, so I am wondering why they are not getting received?
>
> As I said before, everybody posts non-blocking standard receives, then
> non-blocking standard sends, then calls mpi_waitall. Each process is
> typically waiting on 200 to 300 requests. Is deadlock possible via this
> implementation approach under some kind of unusual conditions?
>
> Thanks again,
>
> Ed
>
>> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
>> returns.  The case runs fine with MVAPICH.  The logic associated with the
>> communications has been extensively debugged in the past; we don't think
>> it has errors.   Each process posts non-blocking receives, non-blocking
>> sends, and then does waitall on all the outstanding requests.
>>
>> The work is broken down into 960 chunks. If I run with 960 processes (60
>> nodes of 16 cores each), things seem to work.  If I use 160 processes
>> (each process handling 6 chunks of work), then each process is handling 6
>> times as much communication, and that is the case that hangs with OpenMPI
>> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
>> start, diagnostically?  We're using the openib btl.
>>
>> Thanks,
>>
>> Ed
>> ___
>> users mailing list
>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> __

Re: [OMPI users] Application hangs on mpi_waitall

2013-06-27 Thread Blosch, Edwin L
Attached is the message list for rank 0 for the communication step that is 
failing.  There are about 160 isends and irecvs.  The ‘message size’ is 
actually a number of cells.  On some steps only one 8-byte word per cell is 
communicated, at another step we exchange 7 words, and another step we exchange 
21 words.  You can see the smallest is 10 cells, the largest around 1000 cells.

Thus for the 7-word communication step, the smallest messages are 560 bytes, 
the largest are 56000 bytes, and there is a distribution of sizes.  For the 
single-word communication step, the size distribution would be from 80 bytes to 
8000 and for the 21-word step it would be from 1680 to 168000 bytes.

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Rolf vandeVaart
Sent: Thursday, June 27, 2013 9:02 AM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] Application hangs on mpi_waitall

Ed, how large are the messages that you are sending and receiving?
Rolf

From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[mailto:users-boun...@open-mpi.org] On Behalf Of Ed Blosch
Sent: Thursday, June 27, 2013 9:01 AM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: Re: [OMPI users] Application hangs on mpi_waitall

It ran a bit longer but still deadlocked.  All matching sends are posted 
1:1with posted recvs so it is a delivery issue of some kind.  I'm running a 
debug compiled version tonight to see what that might turn up.  I may try to 
rewrite with blocking sends and see if that works.  I can also try adding a 
barrier (irecvs, barrier, isends, waitall) to make sure sends are not buffering 
waiting for recvs to be posted.


Sent via the Samsung Galaxy S™ III, an AT 4G LTE smartphone



 Original message 
From: George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>>
Date:
To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] Application hangs on mpi_waitall


Ed,

Im not sure but there might be a case where the BTL is getting overwhelmed by 
the nob-blocking operations while trying to setup the connection. There is a 
simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before 
you start posting the non-blocking receives, and let's see if this solves your 
issue.

  George.


On Jun 26, 2013, at 04:02 , eblo...@1scom.net<mailto:eblo...@1scom.net> wrote:

> An update: I recoded the mpi_waitall as a loop over the requests with
> mpi_test and a 30 second timeout.  The timeout happens unpredictably,
> sometimes after 10 minutes of run time, other times after 15 minutes, for
> the exact same case.
>
> After 30 seconds, I print out the status of all outstanding receive
> requests.  The message tags that are outstanding have definitely been
> sent, so I am wondering why they are not getting received?
>
> As I said before, everybody posts non-blocking standard receives, then
> non-blocking standard sends, then calls mpi_waitall. Each process is
> typically waiting on 200 to 300 requests. Is deadlock possible via this
> implementation approach under some kind of unusual conditions?
>
> Thanks again,
>
> Ed
>
>> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
>> returns.  The case runs fine with MVAPICH.  The logic associated with the
>> communications has been extensively debugged in the past; we don't think
>> it has errors.   Each process posts non-blocking receives, non-blocking
>> sends, and then does waitall on all the outstanding requests.
>>
>> The work is broken down into 960 chunks. If I run with 960 processes (60
>> nodes of 16 cores each), things seem to work.  If I use 160 processes
>> (each process handling 6 chunks of work), then each process is handling 6
>> times as much communication, and that is the case that hangs with OpenMPI
>> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
>> start, diagnostically?  We're using the openib btl.
>>
>> Thanks,
>>
>> Ed
>> ___
>> users mailing list
>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.



send_recv.dat
Description: send_recv.dat


Re: [OMPI users] Application hangs on mpi_waitall

2013-06-27 Thread George Bosilca
If I understand correctly the communication parroter is a one-to-all type of 
communication isn't it (from your server to your clients)? In this case this 
might be a credit management issue, where the master is running out of ack 
buffers and the clients can't acknowledge the retrieval of the data.

Let's try to add "--mca btl_openib_flags 9" to the mpirun command (this disable 
the RMA communication and forces everything to have a pure send/recv semantics).

  George.

On Jun 27, 2013, at 15:01 , Ed Blosch <eblo...@1scom.net> wrote:

> It ran a bit longer but still deadlocked.  All matching sends are posted 
> 1:1with posted recvs so it is a delivery issue of some kind.  I'm running a 
> debug compiled version tonight to see what that might turn up.  I may try to 
> rewrite with blocking sends and see if that works.  I can also try adding a 
> barrier (irecvs, barrier, isends, waitall) to make sure sends are not 
> buffering waiting for recvs to be posted.
> 
> 
> Sent via the Samsung Galaxy S™ III, an AT 4G LTE smartphone
> 
> 
> 
>  Original message 
> From: George Bosilca <bosi...@icl.utk.edu> 
> Date: 
> To: Open MPI Users <us...@open-mpi.org> 
> Subject: Re: [OMPI users] Application hangs on mpi_waitall 
> 
> 
> Ed,
> 
> Im not sure but there might be a case where the BTL is getting overwhelmed by 
> the nob-blocking operations while trying to setup the connection. There is a 
> simple test for this. Add an MPI_Alltoall with a reasonable size (100k) 
> before you start posting the non-blocking receives, and let's see if this 
> solves your issue.
> 
>   George.
> 
> 
> On Jun 26, 2013, at 04:02 , eblo...@1scom.net wrote:
> 
> > An update: I recoded the mpi_waitall as a loop over the requests with
> > mpi_test and a 30 second timeout.  The timeout happens unpredictably,
> > sometimes after 10 minutes of run time, other times after 15 minutes, for
> > the exact same case.
> > 
> > After 30 seconds, I print out the status of all outstanding receive
> > requests.  The message tags that are outstanding have definitely been
> > sent, so I am wondering why they are not getting received?
> > 
> > As I said before, everybody posts non-blocking standard receives, then
> > non-blocking standard sends, then calls mpi_waitall. Each process is
> > typically waiting on 200 to 300 requests. Is deadlock possible via this
> > implementation approach under some kind of unusual conditions?
> > 
> > Thanks again,
> > 
> > Ed
> > 
> >> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
> >> returns.  The case runs fine with MVAPICH.  The logic associated with the
> >> communications has been extensively debugged in the past; we don't think
> >> it has errors.   Each process posts non-blocking receives, non-blocking
> >> sends, and then does waitall on all the outstanding requests.
> >> 
> >> The work is broken down into 960 chunks. If I run with 960 processes (60
> >> nodes of 16 cores each), things seem to work.  If I use 160 processes
> >> (each process handling 6 chunks of work), then each process is handling 6
> >> times as much communication, and that is the case that hangs with OpenMPI
> >> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
> >> start, diagnostically?  We're using the openib btl.
> >> 
> >> Thanks,
> >> 
> >> Ed
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Application hangs on mpi_waitall

2013-06-27 Thread Rolf vandeVaart
Ed, how large are the messages that you are sending and receiving?
Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ed Blosch
Sent: Thursday, June 27, 2013 9:01 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Application hangs on mpi_waitall

It ran a bit longer but still deadlocked.  All matching sends are posted 
1:1with posted recvs so it is a delivery issue of some kind.  I'm running a 
debug compiled version tonight to see what that might turn up.  I may try to 
rewrite with blocking sends and see if that works.  I can also try adding a 
barrier (irecvs, barrier, isends, waitall) to make sure sends are not buffering 
waiting for recvs to be posted.


Sent via the Samsung Galaxy S™ III, an AT 4G LTE smartphone



 Original message 
From: George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>>
Date:
To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] Application hangs on mpi_waitall


Ed,

Im not sure but there might be a case where the BTL is getting overwhelmed by 
the nob-blocking operations while trying to setup the connection. There is a 
simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before 
you start posting the non-blocking receives, and let's see if this solves your 
issue.

  George.


On Jun 26, 2013, at 04:02 , eblo...@1scom.net<mailto:eblo...@1scom.net> wrote:

> An update: I recoded the mpi_waitall as a loop over the requests with
> mpi_test and a 30 second timeout.  The timeout happens unpredictably,
> sometimes after 10 minutes of run time, other times after 15 minutes, for
> the exact same case.
>
> After 30 seconds, I print out the status of all outstanding receive
> requests.  The message tags that are outstanding have definitely been
> sent, so I am wondering why they are not getting received?
>
> As I said before, everybody posts non-blocking standard receives, then
> non-blocking standard sends, then calls mpi_waitall. Each process is
> typically waiting on 200 to 300 requests. Is deadlock possible via this
> implementation approach under some kind of unusual conditions?
>
> Thanks again,
>
> Ed
>
>> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
>> returns.  The case runs fine with MVAPICH.  The logic associated with the
>> communications has been extensively debugged in the past; we don't think
>> it has errors.   Each process posts non-blocking receives, non-blocking
>> sends, and then does waitall on all the outstanding requests.
>>
>> The work is broken down into 960 chunks. If I run with 960 processes (60
>> nodes of 16 cores each), things seem to work.  If I use 160 processes
>> (each process handling 6 chunks of work), then each process is handling 6
>> times as much communication, and that is the case that hangs with OpenMPI
>> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
>> start, diagnostically?  We're using the openib btl.
>>
>> Thanks,
>>
>> Ed
>> ___
>> users mailing list
>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] Application hangs on mpi_waitall

2013-06-27 Thread Ed Blosch
It ran a bit longer but still deadlocked.  All matching sends are posted 
1:1with posted recvs so it is a delivery issue of some kind.  I'm running a 
debug compiled version tonight to see what that might turn up.  I may try to 
rewrite with blocking sends and see if that works.  I can also try adding a 
barrier (irecvs, barrier, isends, waitall) to make sure sends are not buffering 
waiting for recvs to be posted.


Sent via the Samsung Galaxy S™ III, an AT 4G LTE smartphone

 Original message 
From: George Bosilca <bosi...@icl.utk.edu> 
Date:  
To: Open MPI Users <us...@open-mpi.org> 
Subject: Re: [OMPI users] Application hangs on mpi_waitall 
 
Ed,

Im not sure but there might be a case where the BTL is getting overwhelmed by 
the nob-blocking operations while trying to setup the connection. There is a 
simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before 
you start posting the non-blocking receives, and let's see if this solves your 
issue.

  George.


On Jun 26, 2013, at 04:02 , eblo...@1scom.net wrote:

> An update: I recoded the mpi_waitall as a loop over the requests with
> mpi_test and a 30 second timeout.  The timeout happens unpredictably,
> sometimes after 10 minutes of run time, other times after 15 minutes, for
> the exact same case.
> 
> After 30 seconds, I print out the status of all outstanding receive
> requests.  The message tags that are outstanding have definitely been
> sent, so I am wondering why they are not getting received?
> 
> As I said before, everybody posts non-blocking standard receives, then
> non-blocking standard sends, then calls mpi_waitall. Each process is
> typically waiting on 200 to 300 requests. Is deadlock possible via this
> implementation approach under some kind of unusual conditions?
> 
> Thanks again,
> 
> Ed
> 
>> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
>> returns.  The case runs fine with MVAPICH.  The logic associated with the
>> communications has been extensively debugged in the past; we don't think
>> it has errors.   Each process posts non-blocking receives, non-blocking
>> sends, and then does waitall on all the outstanding requests.
>> 
>> The work is broken down into 960 chunks. If I run with 960 processes (60
>> nodes of 16 cores each), things seem to work.  If I use 160 processes
>> (each process handling 6 chunks of work), then each process is handling 6
>> times as much communication, and that is the case that hangs with OpenMPI
>> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
>> start, diagnostically?  We're using the openib btl.
>> 
>> Thanks,
>> 
>> Ed
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Application hangs on mpi_waitall

2013-06-26 Thread George Bosilca
Ed,

Im not sure but there might be a case where the BTL is getting overwhelmed by 
the nob-blocking operations while trying to setup the connection. There is a 
simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before 
you start posting the non-blocking receives, and let's see if this solves your 
issue.

  George.


On Jun 26, 2013, at 04:02 , eblo...@1scom.net wrote:

> An update: I recoded the mpi_waitall as a loop over the requests with
> mpi_test and a 30 second timeout.  The timeout happens unpredictably,
> sometimes after 10 minutes of run time, other times after 15 minutes, for
> the exact same case.
> 
> After 30 seconds, I print out the status of all outstanding receive
> requests.  The message tags that are outstanding have definitely been
> sent, so I am wondering why they are not getting received?
> 
> As I said before, everybody posts non-blocking standard receives, then
> non-blocking standard sends, then calls mpi_waitall. Each process is
> typically waiting on 200 to 300 requests. Is deadlock possible via this
> implementation approach under some kind of unusual conditions?
> 
> Thanks again,
> 
> Ed
> 
>> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
>> returns.  The case runs fine with MVAPICH.  The logic associated with the
>> communications has been extensively debugged in the past; we don't think
>> it has errors.   Each process posts non-blocking receives, non-blocking
>> sends, and then does waitall on all the outstanding requests.
>> 
>> The work is broken down into 960 chunks. If I run with 960 processes (60
>> nodes of 16 cores each), things seem to work.  If I use 160 processes
>> (each process handling 6 chunks of work), then each process is handling 6
>> times as much communication, and that is the case that hangs with OpenMPI
>> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
>> start, diagnostically?  We're using the openib btl.
>> 
>> Thanks,
>> 
>> Ed
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Application hangs on mpi_waitall

2013-06-25 Thread eblosch
An update: I recoded the mpi_waitall as a loop over the requests with
mpi_test and a 30 second timeout.  The timeout happens unpredictably,
sometimes after 10 minutes of run time, other times after 15 minutes, for
the exact same case.

After 30 seconds, I print out the status of all outstanding receive
requests.  The message tags that are outstanding have definitely been
sent, so I am wondering why they are not getting received?

As I said before, everybody posts non-blocking standard receives, then
non-blocking standard sends, then calls mpi_waitall. Each process is
typically waiting on 200 to 300 requests. Is deadlock possible via this
implementation approach under some kind of unusual conditions?

Thanks again,

Ed

> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
> returns.  The case runs fine with MVAPICH.  The logic associated with the
> communications has been extensively debugged in the past; we don't think
> it has errors.   Each process posts non-blocking receives, non-blocking
> sends, and then does waitall on all the outstanding requests.
>
> The work is broken down into 960 chunks. If I run with 960 processes (60
> nodes of 16 cores each), things seem to work.  If I use 160 processes
> (each process handling 6 chunks of work), then each process is handling 6
> times as much communication, and that is the case that hangs with OpenMPI
> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
> start, diagnostically?  We're using the openib btl.
>
> Thanks,
>
> Ed
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users