Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
Between a few off-list emails, Ralph was able to reproduce this problem on odin 
when he forced the use of the oob connection code in the openib BTL.
I have created a ticket to track this issue.  Not sure what we will do with 
this issue.

https://svn.open-mpi.org/trac/ompi/ticket/3746


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:52 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Correction: That line below should be:
gmake run FILE=p2p_c

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

I just retried and I still get errors with the latest trunk. (29112).  If I 
back up to r29057, then everything is fine.  In addition, I can reproduce this 
on two different clusters.
Can you try running the entire intel test suite and see if that works?  Maybe a 
different test will fail for you.

   cd ompi-tests/trunk/intel_tests/src
gmake run FILE=cuda_c

You need to modify Makefile in intel_tests to make it do the right thing.  
Trying to figure out what I should do next.  As I said, I get a variety of 
different failures.  Maybe I should collect them up and see what it means.  
This failure has me dead in the water with the trunk.



From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:41 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Sigh - I cannot get it to fail. I've tried up to np=16 without getting a single 
hiccup.

Try a fresh checkout - let's make sure you don't have some old cruft laying 
around.

On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:

I am running a debug build.  Here is my configure line:

../configure --enable-debug --enable-shared --disable-static 
--prefix=/home/rolf/ompi-trunk-29061/64 --with- 
wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
--enable-orterun-prefix-by-default -disable-io-romio  --enable-picky

The test program is from the intel test suite in our test suite.
http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c<http://svn.open-mpi.org/svn/ompi-tests/trunk/intel/src/MPI_Irecv_comm_c.c>

Run with at least np=4.  The more np, the better.


From: devel [mailto:devel-boun...@open-mpi.org<mailto:boun...@open-mpi.org>] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:22 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:

Dang - I just finished running it on odin without a problem. Are you seeing 
this with a debug or optimized build?


On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:

Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org<mailto:boun...@open-mpi.org>] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:


As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go 

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
Correction: That line below should be:
gmake run FILE=p2p_c

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

I just retried and I still get errors with the latest trunk. (29112).  If I 
back up to r29057, then everything is fine.  In addition, I can reproduce this 
on two different clusters.
Can you try running the entire intel test suite and see if that works?  Maybe a 
different test will fail for you.

   cd ompi-tests/trunk/intel_tests/src
gmake run FILE=cuda_c

You need to modify Makefile in intel_tests to make it do the right thing.  
Trying to figure out what I should do next.  As I said, I get a variety of 
different failures.  Maybe I should collect them up and see what it means.  
This failure has me dead in the water with the trunk.



From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:41 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Sigh - I cannot get it to fail. I've tried up to np=16 without getting a single 
hiccup.

Try a fresh checkout - let's make sure you don't have some old cruft laying 
around.

On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:

I am running a debug build.  Here is my configure line:

../configure --enable-debug --enable-shared --disable-static 
--prefix=/home/rolf/ompi-trunk-29061/64 --with- 
wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
--enable-orterun-prefix-by-default -disable-io-romio  --enable-picky

The test program is from the intel test suite in our test suite.
http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c<http://svn.open-mpi.org/svn/ompi-tests/trunk/intel/src/MPI_Irecv_comm_c.c>

Run with at least np=4.  The more np, the better.


From: devel [mailto:devel-boun...@open-mpi.org<mailto:boun...@open-mpi.org>] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:22 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:


Dang - I just finished running it on odin without a problem. Are you seeing 
this with a debug or optimized build?


On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:


Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org<mailto:boun...@open-mpi.org>] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:



As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
MPI_Irecv_comm_c
MPITEST info  (0): Starting:  MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: 
Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) 
[compute-0-4:04752] Failing at address: 0x28
--
mpirun noticed that process rank 3 w

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
I just retried and I still get errors with the latest trunk. (29112).  If I 
back up to r29057, then everything is fine.  In addition, I can reproduce this 
on two different clusters.
Can you try running the entire intel test suite and see if that works?  Maybe a 
different test will fail for you.

   cd ompi-tests/trunk/intel_tests/src
gmake run FILE=cuda_c

You need to modify Makefile in intel_tests to make it do the right thing.  
Trying to figure out what I should do next.  As I said, I get a variety of 
different failures.  Maybe I should collect them up and see what it means.  
This failure has me dead in the water with the trunk.



From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:41 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Sigh - I cannot get it to fail. I've tried up to np=16 without getting a single 
hiccup.

Try a fresh checkout - let's make sure you don't have some old cruft laying 
around.

On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:


I am running a debug build.  Here is my configure line:

../configure --enable-debug --enable-shared --disable-static 
--prefix=/home/rolf/ompi-trunk-29061/64 --with- 
wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
--enable-orterun-prefix-by-default -disable-io-romio  --enable-picky

The test program is from the intel test suite in our test suite.
http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c<http://svn.open-mpi.org/svn/ompi-tests/trunk/intel/src/MPI_Irecv_comm_c.c>

Run with at least np=4.  The more np, the better.


From: devel [mailto:devel-boun...@open-mpi.org<mailto:boun...@open-mpi.org>] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:22 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:



Dang - I just finished running it on odin without a problem. Are you seeing 
this with a debug or optimized build?


On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:



Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org<mailto:boun...@open-mpi.org>] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:




As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
MPI_Irecv_comm_c
MPITEST info  (0): Starting:  MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: 
Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) 
[compute-0-4:04752] Failing at address: 0x28
--
mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 
11 (Segmentation fault).
--
[rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
(6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL versio

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Ralph Castain
Sigh - I cannot get it to fail. I've tried up to np=16 without getting a single 
hiccup.

Try a fresh checkout - let's make sure you don't have some old cruft laying 
around.

On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:

> I am running a debug build.  Here is my configure line:
>  
> ../configure --enable-debug --enable-shared --disable-static 
> --prefix=/home/rolf/ompi-trunk-29061/64 --with- 
> wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
> --enable-orterun-prefix-by-default -disable-io-romio  --enable-picky
>  
> The test program is from the intel test suite in our test suite.
> http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c
>  
> Run with at least np=4.  The more np, the better.
>  
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, September 03, 2013 3:22 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>  
> Also, send me your test code - maybe that is required to trigger it
>  
> On Sep 3, 2013, at 12:19 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> 
> Dang - I just finished running it on odin without a problem. Are you seeing 
> this with a debug or optimized build?
>  
>  
> On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
> 
> 
> Yes, it fails on the current trunk (r29112).  That is what started me on the 
> journey to figure out when things went wrong.  It was working up until r29058.
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, September 03, 2013 2:49 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>  
> Are you all the way up to the current trunk? There have been a few typo fixes 
> since the original commit.
>  
> I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
> using free list, so I suspect it is something up in the OOB connect code 
> itself. I'll take a look and see if something leaps out at me - it seems to 
> be working fine on IU's odin cluster, which is the only IB-based system I can 
> access
>  
>  
> On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
> 
> 
> 
> As mentioned in the weekly conference call, I am seeing some strange errors 
> when using the openib BTL.  I have narrowed down the changeset that broke 
> things to the ORTE async code.
>  
> https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
> https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
> compile errors)
>  
> Changeset 29057 does not have these issues.  I do not have a very good 
> characterization of the failures.  The failures are not consistent.  
> Sometimes they can pass.  Sometimes the stack trace can be different.  They 
> seem to happen more with larger np, like np=4 and more.   
>  
> The first failure mode is a segmentation violation and it always seems to be 
> that we are trying to pop something of a free list.  But the upper parts of 
> the stack trace can vary.  This is with the trunk version 29061.
> Ralph, any thoughts on where we go from here?
>  
> [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
> MPI_Irecv_comm_c
> MPITEST info  (0): Starting:  MPI_Irecv_comm:   
> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] 
> Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not 
> mapped (1) [compute-0-4:04752] Failing at address: 0x28
> --
> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on 
> signal 11 (Segmentation fault).
> --
> [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
> (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> Core was generated by `MPI_Irecv_comm_c'.
> Program terminated with signal 11, Segmentation fault.
> [New process 4753]
> [New process 4756]
> [New process 4755]
> [New process 4754]
> [New process 4752]
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> 111 lifo->opal_lifo_head = 
> (opal_list_item_t*)item->opal_list_next;
> (gdb) where
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> #1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
> it

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
I am running a debug build.  Here is my configure line:

../configure --enable-debug --enable-shared --disable-static 
--prefix=/home/rolf/ompi-trunk-29061/64 --with- 
wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
--enable-orterun-prefix-by-default -disable-io-romio  --enable-picky

The test program is from the intel test suite in our test suite.
http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c<http://svn.open-mpi.org/svn/ompi-tests/trunk/intel/src/MPI_Irecv_comm_c.c>

Run with at least np=4.  The more np, the better.


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:22 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:


Dang - I just finished running it on odin without a problem. Are you seeing 
this with a debug or optimized build?


On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:


Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org<mailto:boun...@open-mpi.org>] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:



As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
MPI_Irecv_comm_c
MPITEST info  (0): Starting:  MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: 
Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) 
[compute-0-4:04752] Failing at address: 0x28
--
mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 
11 (Segmentation fault).
--
[rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
(6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
Core was generated by `MPI_Irecv_comm_c'.
Program terminated with signal 11, Segmentation fault.
[New process 4753]
[New process 4756]
[New process 4755]
[New process 4754]
[New process 4752]
#0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
111 lifo->opal_lifo_head = (opal_list_item_t*)item->opal_list_next;
(gdb) where
#0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
#1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
#2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
#3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock (ep=0x59f3120, 
qp=0)
at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
#4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
(endpoint=0x59f3120)
at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
#5  0x2d6fe71c in qp_create_all (endpoi

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Ralph Castain
Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Dang - I just finished running it on odin without a problem. Are you seeing 
> this with a debug or optimized build?
> 
> 
> On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
> 
>> Yes, it fails on the current trunk (r29112).  That is what started me on the 
>> journey to figure out when things went wrong.  It was working up until 
>> r29058.
>>  
>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
>> Sent: Tuesday, September 03, 2013 2:49 PM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>>  
>> Are you all the way up to the current trunk? There have been a few typo 
>> fixes since the original commit.
>>  
>> I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
>> using free list, so I suspect it is something up in the OOB connect code 
>> itself. I'll take a look and see if something leaps out at me - it seems to 
>> be working fine on IU's odin cluster, which is the only IB-based system I 
>> can access
>>  
>>  
>> On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
>> 
>> 
>> As mentioned in the weekly conference call, I am seeing some strange errors 
>> when using the openib BTL.  I have narrowed down the changeset that broke 
>> things to the ORTE async code.
>>  
>> https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
>> https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
>> compile errors)
>>  
>> Changeset 29057 does not have these issues.  I do not have a very good 
>> characterization of the failures.  The failures are not consistent.  
>> Sometimes they can pass.  Sometimes the stack trace can be different.  They 
>> seem to happen more with larger np, like np=4 and more.   
>>  
>> The first failure mode is a segmentation violation and it always seems to be 
>> that we are trying to pop something of a free list.  But the upper parts of 
>> the stack trace can vary.  This is with the trunk version 29061.
>> Ralph, any thoughts on where we go from here?
>>  
>> [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
>> MPI_Irecv_comm_c
>> MPITEST info  (0): Starting:  MPI_Irecv_comm:   
>> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] 
>> Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not 
>> mapped (1) [compute-0-4:04752] Failing at address: 0x28
>> --
>> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on 
>> signal 11 (Segmentation fault).
>> --
>> [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
>> (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> Core was generated by `MPI_Irecv_comm_c'.
>> Program terminated with signal 11, Segmentation fault.
>> [New process 4753]
>> [New process 4756]
>> [New process 4755]
>> [New process 4754]
>> [New process 4752]
>> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
>> ../../../../../opal/class/opal_atomic_lifo.h:111
>> 111 lifo->opal_lifo_head = 
>> (opal_list_item_t*)item->opal_list_next;
>> (gdb) where
>> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
>> ../../../../../opal/class/opal_atomic_lifo.h:111
>> #1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
>> item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
>> #2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
>> ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
>> #3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock 
>> (ep=0x59f3120, qp=0)
>> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
>> #4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
>> (endpoint=0x59f3120)
>> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
>> #5  0x2d6fe71c in qp_create_all (endpoint=0x59f3120) at 
>> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
>> #6  0x2d6fde2b in reply_start_connect (endpoint=0x59f3120, 
>> rem

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Ralph Castain
Dang - I just finished running it on odin without a problem. Are you seeing 
this with a debug or optimized build?


On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:

> Yes, it fails on the current trunk (r29112).  That is what started me on the 
> journey to figure out when things went wrong.  It was working up until r29058.
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, September 03, 2013 2:49 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>  
> Are you all the way up to the current trunk? There have been a few typo fixes 
> since the original commit.
>  
> I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
> using free list, so I suspect it is something up in the OOB connect code 
> itself. I'll take a look and see if something leaps out at me - it seems to 
> be working fine on IU's odin cluster, which is the only IB-based system I can 
> access
>  
>  
> On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
> 
> 
> As mentioned in the weekly conference call, I am seeing some strange errors 
> when using the openib BTL.  I have narrowed down the changeset that broke 
> things to the ORTE async code.
>  
> https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
> https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
> compile errors)
>  
> Changeset 29057 does not have these issues.  I do not have a very good 
> characterization of the failures.  The failures are not consistent.  
> Sometimes they can pass.  Sometimes the stack trace can be different.  They 
> seem to happen more with larger np, like np=4 and more.   
>  
> The first failure mode is a segmentation violation and it always seems to be 
> that we are trying to pop something of a free list.  But the upper parts of 
> the stack trace can vary.  This is with the trunk version 29061.
> Ralph, any thoughts on where we go from here?
>  
> [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
> MPI_Irecv_comm_c
> MPITEST info  (0): Starting:  MPI_Irecv_comm:   
> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] 
> Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not 
> mapped (1) [compute-0-4:04752] Failing at address: 0x28
> --
> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on 
> signal 11 (Segmentation fault).
> --
> [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
> (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> Core was generated by `MPI_Irecv_comm_c'.
> Program terminated with signal 11, Segmentation fault.
> [New process 4753]
> [New process 4756]
> [New process 4755]
> [New process 4754]
> [New process 4752]
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> 111 lifo->opal_lifo_head = 
> (opal_list_item_t*)item->opal_list_next;
> (gdb) where
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> #1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
> item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
> #2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
> ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
> #3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock 
> (ep=0x59f3120, qp=0)
> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
> #4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
> (endpoint=0x59f3120)
> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
> #5  0x2d6fe71c in qp_create_all (endpoint=0x59f3120) at 
> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
> #6  0x2d6fde2b in reply_start_connect (endpoint=0x59f3120, 
> rem_info=0x40ea8ed0)
> at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
> #7  0x2d7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
> buffer=0x40ea8f80, tag=102, cbdata=0x0)
> at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
> #8  0x2ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
> cbdata=0x5b0bac0)
> at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
> #9  0x2ae8027164a1

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:


As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
MPI_Irecv_comm_c
MPITEST info  (0): Starting:  MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: 
Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) 
[compute-0-4:04752] Failing at address: 0x28
--
mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 
11 (Segmentation fault).
--
[rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
(6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
Core was generated by `MPI_Irecv_comm_c'.
Program terminated with signal 11, Segmentation fault.
[New process 4753]
[New process 4756]
[New process 4755]
[New process 4754]
[New process 4752]
#0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
111 lifo->opal_lifo_head = (opal_list_item_t*)item->opal_list_next;
(gdb) where
#0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
#1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
#2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
#3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock (ep=0x59f3120, 
qp=0)
at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
#4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
(endpoint=0x59f3120)
at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
#5  0x2d6fe71c in qp_create_all (endpoint=0x59f3120) at 
../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
#6  0x2d6fde2b in reply_start_connect (endpoint=0x59f3120, 
rem_info=0x40ea8ed0)
at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
#7  0x2d7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
buffer=0x40ea8f80, tag=102, cbdata=0x0)
at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
#8  0x2ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
cbdata=0x5b0bac0)
at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
#9  0x2ae8027164a1 in event_process_active_single_queue (base=0x58ac620, 
activeq=0x58aa5b0)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
#10 0x2ae802716b24 in event_process_active (base=0x58ac620) at 
../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
#11 0x2ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, 
flags=1)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
#12 0x2ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at 
../../orte/runtime/orte_init.c:180
#13 0x003ab1e06367 in start_thread () from /l

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Ralph Castain
Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart  wrote:

> As mentioned in the weekly conference call, I am seeing some strange errors 
> when using the openib BTL.  I have narrowed down the changeset that broke 
> things to the ORTE async code.
>  
> https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
> https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
> compile errors)
>  
> Changeset 29057 does not have these issues.  I do not have a very good 
> characterization of the failures.  The failures are not consistent.  
> Sometimes they can pass.  Sometimes the stack trace can be different.  They 
> seem to happen more with larger np, like np=4 and more.   
>  
> The first failure mode is a segmentation violation and it always seems to be 
> that we are trying to pop something of a free list.  But the upper parts of 
> the stack trace can vary.  This is with the trunk version 29061.
> Ralph, any thoughts on where we go from here?
>  
> [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
> MPI_Irecv_comm_c
> MPITEST info  (0): Starting:  MPI_Irecv_comm:   
> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] 
> Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not 
> mapped (1) [compute-0-4:04752] Failing at address: 0x28
> --
> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on 
> signal 11 (Segmentation fault).
> --
> [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
> (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> Core was generated by `MPI_Irecv_comm_c'.
> Program terminated with signal 11, Segmentation fault.
> [New process 4753]
> [New process 4756]
> [New process 4755]
> [New process 4754]
> [New process 4752]
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> 111 lifo->opal_lifo_head = 
> (opal_list_item_t*)item->opal_list_next;
> (gdb) where
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> #1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
> item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
> #2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
> ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
> #3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock 
> (ep=0x59f3120, qp=0)
> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
> #4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
> (endpoint=0x59f3120)
> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
> #5  0x2d6fe71c in qp_create_all (endpoint=0x59f3120) at 
> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
> #6  0x2d6fde2b in reply_start_connect (endpoint=0x59f3120, 
> rem_info=0x40ea8ed0)
> at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
> #7  0x2d7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
> buffer=0x40ea8f80, tag=102, cbdata=0x0)
> at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
> #8  0x2ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
> cbdata=0x5b0bac0)
> at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
> #9  0x2ae8027164a1 in event_process_active_single_queue (base=0x58ac620, 
> activeq=0x58aa5b0)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
> #10 0x2ae802716b24 in event_process_active (base=0x58ac620) at 
> ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
> #11 0x2ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, 
> flags=1)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
> #12 0x2ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at 
> ../../orte/runtime/orte_init.c:180
> #13 0x003ab1e06367 in start_thread () from /lib64/libpthread.so.0
> #14 0x003ab16d2f7d in clone () from /lib64/libc.so.6
> (gdb)
>  
> This email message is for the sole use of the intended recipient(s) and may 
> contain confidential information.  Any unauthorized review, use, disclosure 
> or distribution is