Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-03 Thread Ralph Castain
Your code is obviously doing something much more than just launching and wiring 
up, so it is difficult to assess the difference in speed between 1.6.5 and 
1.7.3 - my guess is that it has to do with changes in the MPI transport layer 
and nothing to do with PMI or not.

Likewise, I can't imagine any differences in wireup method accounting for the 
500 seconds in execution time difference between the two versions when using 
the same launch method. I launch more than 10 nodes in far less time than that, 
so again I expect this has to do with something in the MPI layer.

The real question is why you see so much difference between launching via 
mpirun vs srun. Like I said, the launch and wireup times on such small scales 
is negligible, so somehow you are winding up selecting different MPI transport 
options. You can test this by just running "hello world" instead - I'll bet the 
mpirun vs srun time differences are a second or two at most.

Perhaps Jeff or someone else can suggest some debug flags you could use to 
understand these differences?



On Sep 3, 2013, at 6:13 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 03/09/13 10:56, Ralph Castain wrote:
> 
>> Yeah - --with-pmi=
> 
> Actually I found that just --with-pmi=/usr/local/slurm/latest worked. :-)
> 
> I've got some initial numbers for 64 cores, as I mentioned the system
> I found this on initially is so busy at the moment I won't be able to
> run anything bigger for a while, so I'm going to move my testing to
> another system which is a bit quieter, but slower (it's Nehalem vs
> SandyBridge).
> 
> All the below tests are with the same NAMD 2.9 binary and within the
> same Slurm job so it runs on the same cores each time. It's nice to
> find that C code at least seems to be backwardly compatible!
> 
> 64 cores over 18 nodes:
> 
> Open-MPI 1.6.5 with mpirun - 7842 seconds
> Open-MPI 1.7.3a1r29103 with srun - 7522 seconds
> 
> so that's about a 4% speedup.
> 
> 64 cores over 10 nodes:
> 
> Open-MPI 1.7.3a1r29103 with mpirun - 8341 seconds
> Open-MPI 1.7.3a1r29103 with srun - 7476 seconds
> 
> So that's about 11% faster, and the mpirun speed has decreased though
> of course that's built using PMI so perhaps that's the cause?
> 
> cheers,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEUEARECAAYFAlImiUUACgkQO2KABBYQAh+WvwCeM1ufCWvK627oz8aBbgKjfONe
> cDEAmM3w+/EJ0unbmaetNR3ay4U6nrM=
> =v/PT
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-09-03 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 04/09/13 04:47, Jeff Squyres (jsquyres) wrote:

> Hmm.  Are you building Open MPI in a special way?  I ask because I'm
> unable to replicate the issue -- I've run your test (and a C
> equivalent) a few hundred times now:

I don't think we do anything unusual, the script we are using is
fairly simple (it does a module purge to ensure we are just using the
system compilers and don't pick up anything strange) and is as follows:

#!/bin/bash

BASE=`basename $PWD | sed -e s,-,/,`

module purge

./configure --prefix=/usr/local/${BASE} --with-slurm --with-openib 
--enable-static  --enable-shared

make -j


- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlImicgACgkQO2KABBYQAh83GQCcDp/TF/lCe3RnmNYq+tl6ef0D
q2AAn3BNG8omGncmLc4HadRPZgRjQEph
=56wh
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-03 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/09/13 10:56, Ralph Castain wrote:

> Yeah - --with-pmi=

Actually I found that just --with-pmi=/usr/local/slurm/latest worked. :-)

I've got some initial numbers for 64 cores, as I mentioned the system
I found this on initially is so busy at the moment I won't be able to
run anything bigger for a while, so I'm going to move my testing to
another system which is a bit quieter, but slower (it's Nehalem vs
SandyBridge).

All the below tests are with the same NAMD 2.9 binary and within the
same Slurm job so it runs on the same cores each time. It's nice to
find that C code at least seems to be backwardly compatible!

64 cores over 18 nodes:

Open-MPI 1.6.5 with mpirun - 7842 seconds
Open-MPI 1.7.3a1r29103 with srun - 7522 seconds

so that's about a 4% speedup.

64 cores over 10 nodes:

Open-MPI 1.7.3a1r29103 with mpirun - 8341 seconds
Open-MPI 1.7.3a1r29103 with srun - 7476 seconds

So that's about 11% faster, and the mpirun speed has decreased though
of course that's built using PMI so perhaps that's the cause?

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEUEARECAAYFAlImiUUACgkQO2KABBYQAh+WvwCeM1ufCWvK627oz8aBbgKjfONe
cDEAmM3w+/EJ0unbmaetNR3ay4U6nrM=
=v/PT
-END PGP SIGNATURE-


[hwloc-devel] Create success (hwloc r1.7.3rc1r5779)

2013-09-03 Thread MPI Team
Creating nightly hwloc snapshot SVN tarball was a success.

Snapshot:   hwloc 1.7.3rc1r5779
Start time: Tue Sep  3 21:05:21 EDT 2013
End time:   Tue Sep  3 21:09:24 EDT 2013

Your friendly daemon,
Cyrador


[hwloc-devel] Create success (hwloc r1.8a1r5786)

2013-09-03 Thread MPI Team
Creating nightly hwloc snapshot SVN tarball was a success.

Snapshot:   hwloc 1.8a1r5786
Start time: Tue Sep  3 21:01:01 EDT 2013
End time:   Tue Sep  3 21:05:21 EDT 2013

Your friendly daemon,
Cyrador


Re: [OMPI devel] GNU Automake 1.14 released

2013-09-03 Thread Ralph Castain
I still don't see an issue with just detecting the version of automake being 
used, and setting a conditional that indicates whether or not to use explicitly 
include the subdir. Seems like a pretty trivial solution.


On Sep 3, 2013, at 3:49 PM, "Jeff Squyres (jsquyres)"  
wrote:

> On Sep 3, 2013, at 6:45 PM, Fabrício Zimmerer Murta 
>  wrote:
> 
>> I think autotools has a concept of disallowing symlinks as it seems symlinks 
>> can't be done in a portable way, and the goal of autotools is making 
>> projects portable.
>> 
>> Well, if the autotools user feels like using symlinks, then it must be 
>> expected to break portability wherever you take your autoconfiscated code 
>> to. A choice to the user. Maybe in the case, as the project is bound to 
>> specific compilers, it would not be a problem to loose portability a bit 
>> more by considering symbolic linking around.
> 
> Fair enough.
> 
> We've been using sym links in the OMPI project for years in order to compile 
> a series of .c files in 2 different ways.  It's portable to all the places 
> that we need/want it.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] GNU Automake 1.14 released

2013-09-03 Thread Jeff Squyres (jsquyres)
On Sep 3, 2013, at 6:45 PM, Fabrício Zimmerer Murta  
wrote:

> I think autotools has a concept of disallowing symlinks as it seems symlinks 
> can't be done in a portable way, and the goal of autotools is making projects 
> portable.
> 
> Well, if the autotools user feels like using symlinks, then it must be 
> expected to break portability wherever you take your autoconfiscated code to. 
> A choice to the user. Maybe in the case, as the project is bound to specific 
> compilers, it would not be a problem to loose portability a bit more by 
> considering symbolic linking around.

Fair enough.

We've been using sym links in the OMPI project for years in order to compile a 
series of .c files in 2 different ways.  It's portable to all the places that 
we need/want it.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
Between a few off-list emails, Ralph was able to reproduce this problem on odin 
when he forced the use of the oob connection code in the openib BTL.
I have created a ticket to track this issue.  Not sure what we will do with 
this issue.

https://svn.open-mpi.org/trac/ompi/ticket/3746


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:52 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Correction: That line below should be:
gmake run FILE=p2p_c

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

I just retried and I still get errors with the latest trunk. (29112).  If I 
back up to r29057, then everything is fine.  In addition, I can reproduce this 
on two different clusters.
Can you try running the entire intel test suite and see if that works?  Maybe a 
different test will fail for you.

   cd ompi-tests/trunk/intel_tests/src
gmake run FILE=cuda_c

You need to modify Makefile in intel_tests to make it do the right thing.  
Trying to figure out what I should do next.  As I said, I get a variety of 
different failures.  Maybe I should collect them up and see what it means.  
This failure has me dead in the water with the trunk.



From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:41 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Sigh - I cannot get it to fail. I've tried up to np=16 without getting a single 
hiccup.

Try a fresh checkout - let's make sure you don't have some old cruft laying 
around.

On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart 
> wrote:

I am running a debug build.  Here is my configure line:

../configure --enable-debug --enable-shared --disable-static 
--prefix=/home/rolf/ompi-trunk-29061/64 --with- 
wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
--enable-orterun-prefix-by-default -disable-io-romio  --enable-picky

The test program is from the intel test suite in our test suite.
http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c

Run with at least np=4.  The more np, the better.


From: devel [mailto:devel-boun...@open-mpi.org] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:22 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain 
> wrote:

Dang - I just finished running it on odin without a problem. Are you seeing 
this with a debug or optimized build?


On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart 
> wrote:

Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
> wrote:


As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host 

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
Correction: That line below should be:
gmake run FILE=p2p_c

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

I just retried and I still get errors with the latest trunk. (29112).  If I 
back up to r29057, then everything is fine.  In addition, I can reproduce this 
on two different clusters.
Can you try running the entire intel test suite and see if that works?  Maybe a 
different test will fail for you.

   cd ompi-tests/trunk/intel_tests/src
gmake run FILE=cuda_c

You need to modify Makefile in intel_tests to make it do the right thing.  
Trying to figure out what I should do next.  As I said, I get a variety of 
different failures.  Maybe I should collect them up and see what it means.  
This failure has me dead in the water with the trunk.



From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:41 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Sigh - I cannot get it to fail. I've tried up to np=16 without getting a single 
hiccup.

Try a fresh checkout - let's make sure you don't have some old cruft laying 
around.

On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart 
> wrote:

I am running a debug build.  Here is my configure line:

../configure --enable-debug --enable-shared --disable-static 
--prefix=/home/rolf/ompi-trunk-29061/64 --with- 
wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
--enable-orterun-prefix-by-default -disable-io-romio  --enable-picky

The test program is from the intel test suite in our test suite.
http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c

Run with at least np=4.  The more np, the better.


From: devel [mailto:devel-boun...@open-mpi.org] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:22 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain 
> wrote:


Dang - I just finished running it on odin without a problem. Are you seeing 
this with a debug or optimized build?


On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart 
> wrote:


Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
> wrote:



As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
MPI_Irecv_comm_c
MPITEST info  (0): Starting:  MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: 
Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) 
[compute-0-4:04752] Failing at address: 0x28
--
mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 
11 (Segmentation fault).

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
I just retried and I still get errors with the latest trunk. (29112).  If I 
back up to r29057, then everything is fine.  In addition, I can reproduce this 
on two different clusters.
Can you try running the entire intel test suite and see if that works?  Maybe a 
different test will fail for you.

   cd ompi-tests/trunk/intel_tests/src
gmake run FILE=cuda_c

You need to modify Makefile in intel_tests to make it do the right thing.  
Trying to figure out what I should do next.  As I said, I get a variety of 
different failures.  Maybe I should collect them up and see what it means.  
This failure has me dead in the water with the trunk.



From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:41 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Sigh - I cannot get it to fail. I've tried up to np=16 without getting a single 
hiccup.

Try a fresh checkout - let's make sure you don't have some old cruft laying 
around.

On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart 
> wrote:


I am running a debug build.  Here is my configure line:

../configure --enable-debug --enable-shared --disable-static 
--prefix=/home/rolf/ompi-trunk-29061/64 --with- 
wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
--enable-orterun-prefix-by-default -disable-io-romio  --enable-picky

The test program is from the intel test suite in our test suite.
http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c

Run with at least np=4.  The more np, the better.


From: devel [mailto:devel-boun...@open-mpi.org] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:22 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain 
> wrote:



Dang - I just finished running it on odin without a problem. Are you seeing 
this with a debug or optimized build?


On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart 
> wrote:



Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
> wrote:




As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
MPI_Irecv_comm_c
MPITEST info  (0): Starting:  MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: 
Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) 
[compute-0-4:04752] Failing at address: 0x28
--
mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 
11 (Segmentation fault).
--
[rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
(6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
Core was generated by 

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Ralph Castain
Sigh - I cannot get it to fail. I've tried up to np=16 without getting a single 
hiccup.

Try a fresh checkout - let's make sure you don't have some old cruft laying 
around.

On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart  wrote:

> I am running a debug build.  Here is my configure line:
>  
> ../configure --enable-debug --enable-shared --disable-static 
> --prefix=/home/rolf/ompi-trunk-29061/64 --with- 
> wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
> --enable-orterun-prefix-by-default -disable-io-romio  --enable-picky
>  
> The test program is from the intel test suite in our test suite.
> http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c
>  
> Run with at least np=4.  The more np, the better.
>  
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, September 03, 2013 3:22 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>  
> Also, send me your test code - maybe that is required to trigger it
>  
> On Sep 3, 2013, at 12:19 PM, Ralph Castain  wrote:
> 
> 
> Dang - I just finished running it on odin without a problem. Are you seeing 
> this with a debug or optimized build?
>  
>  
> On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart  wrote:
> 
> 
> Yes, it fails on the current trunk (r29112).  That is what started me on the 
> journey to figure out when things went wrong.  It was working up until r29058.
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, September 03, 2013 2:49 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>  
> Are you all the way up to the current trunk? There have been a few typo fixes 
> since the original commit.
>  
> I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
> using free list, so I suspect it is something up in the OOB connect code 
> itself. I'll take a look and see if something leaps out at me - it seems to 
> be working fine on IU's odin cluster, which is the only IB-based system I can 
> access
>  
>  
> On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart  wrote:
> 
> 
> 
> As mentioned in the weekly conference call, I am seeing some strange errors 
> when using the openib BTL.  I have narrowed down the changeset that broke 
> things to the ORTE async code.
>  
> https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
> https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
> compile errors)
>  
> Changeset 29057 does not have these issues.  I do not have a very good 
> characterization of the failures.  The failures are not consistent.  
> Sometimes they can pass.  Sometimes the stack trace can be different.  They 
> seem to happen more with larger np, like np=4 and more.   
>  
> The first failure mode is a segmentation violation and it always seems to be 
> that we are trying to pop something of a free list.  But the upper parts of 
> the stack trace can vary.  This is with the trunk version 29061.
> Ralph, any thoughts on where we go from here?
>  
> [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
> MPI_Irecv_comm_c
> MPITEST info  (0): Starting:  MPI_Irecv_comm:   
> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] 
> Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not 
> mapped (1) [compute-0-4:04752] Failing at address: 0x28
> --
> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on 
> signal 11 (Segmentation fault).
> --
> [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
> (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> Core was generated by `MPI_Irecv_comm_c'.
> Program terminated with signal 11, Segmentation fault.
> [New process 4753]
> [New process 4756]
> [New process 4755]
> [New process 4754]
> [New process 4752]
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> 111 lifo->opal_lifo_head = 
> (opal_list_item_t*)item->opal_list_next;
> (gdb) where
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> #1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
> item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
> #2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
> ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
> #3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock 
> (ep=0x59f3120, qp=0)
> at 

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
I am running a debug build.  Here is my configure line:

../configure --enable-debug --enable-shared --disable-static 
--prefix=/home/rolf/ompi-trunk-29061/64 --with- 
wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
--enable-orterun-prefix-by-default -disable-io-romio  --enable-picky

The test program is from the intel test suite in our test suite.
http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c

Run with at least np=4.  The more np, the better.


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:22 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain 
> wrote:


Dang - I just finished running it on odin without a problem. Are you seeing 
this with a debug or optimized build?


On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart 
> wrote:


Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org] On 
Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
> wrote:



As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
MPI_Irecv_comm_c
MPITEST info  (0): Starting:  MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: 
Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) 
[compute-0-4:04752] Failing at address: 0x28
--
mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 
11 (Segmentation fault).
--
[rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
(6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
Core was generated by `MPI_Irecv_comm_c'.
Program terminated with signal 11, Segmentation fault.
[New process 4753]
[New process 4756]
[New process 4755]
[New process 4754]
[New process 4752]
#0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
111 lifo->opal_lifo_head = (opal_list_item_t*)item->opal_list_next;
(gdb) where
#0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
#1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
#2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
#3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock (ep=0x59f3120, 
qp=0)
at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
#4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
(endpoint=0x59f3120)
at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
#5  0x2d6fe71c in qp_create_all (endpoint=0x59f3120) at 

Re: [OMPI devel] GNU Automake 1.14 released

2013-09-03 Thread Jeff Squyres (jsquyres)
How about sym linking the source file?  Then you would only need a single 
Makefile.am; you can use different flags depending on which source file you 
compile.

While somewhat gross, it's not totally disgusting, and it should work to the 
same effect...?


On Aug 30, 2013, at 4:16 AM, Bert Wesarg  wrote:

> Hi,
> 
> On Fri, Jun 21, 2013 at 2:01 PM, Stefano Lattarini
>  wrote:
>> We are pleased to announce the GNU Automake 1.14 minor release.
>> 
>> 
>>  - The next major Automake version (2.0) will unconditionally activate
>>the 'subdir-objects' option.  In order to smooth out the transition,
>>we now give a warning (in the category 'unsupported') whenever a
>>source file is present in a subdirectory but the 'subdir-object' is
>>not enabled.  For example, the following usage will trigger such a
>>warning:
>> 
>>bin_PROGRAMS = sub/foo
>>sub_foo_SOURCES = sub/main.c sub/bar.c
>> 
> 
> we don't understand how this warning should 'smooth' the transition to
> post-1.14 in our project.
> 
> Here is our situation:
> 
> We have a source file which needs to be compiled twice. But with
> different compilers. Thus we can't use per-target flags and we use two
> separate Makefile.am files for this. Because the compilation rules are
> nearly identical, we use a Makefile.common.inc.am file which will be
> included by both Makefile.am's. Here is the directory layout (the
> complete reduced testcase is attached):
> 
> src/foo.c
> src/Makefile.am
> src/Makefile.common.inc.am
> src/second/Makefile.am
> 
> The src/Makefile.am looks like:
> 
>  8< src/Makefile.am 8< ---
> SUBDIRS = second
> 
> MY_SRCDIR=.
> include Makefile.common.inc.am
> 
> bin_PROGRAMS=foo
> foo_SOURCES=$(FOO_COMMONSOURCES)
>  >8 src/Makefile.am >8 ---
> 
>  8< src/second/Makefile.am 8< ---
> CC=$(top_srcdir)/bin/wrapper
> 
> MY_SRCDIR=..
> include ../Makefile.common.inc.am
> 
> bin_PROGRAMS=foo-wrapped
> foo_wrapped_SOURCES=$(FOO_COMMONSOURCES)
>  >8 src/second/Makefile.am >8 ---
> 
>  8< src/Makefile.common.inc.am 8< ---
> FOO_COMMONSOURCES = $(MY_SRCDIR)/foo.c
>  >8 src/Makefile.common.inc.am >8 ---
> 
> This works with automake 1.13.4 as expected. Now, with automake 1.14
> we get the newly introduced warning mentioned above in the release
> statements. Now enabling subdir-objects is not yet an option for us,
> because we use variables in the _SOURCES list and bug 13928 [1] hits
> us.
> 
> So what would be the best transition in this situation? We don't want
> to remove the Makefile.common.inc.am to avoid the resulting redundancy
> in the two Makefile.am files. We also can't use the newly introduced
> %reldir%, because it also throws the warning, and also want to
> maintain compatibly with pre-1.14 automake.
> 
> Any guidance is more than welcomed.
> 
> Kind Regards,
> Matthias Jurenz & Bert Wesarg
> 
> [1] http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13928
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Ralph Castain
Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain  wrote:

> Dang - I just finished running it on odin without a problem. Are you seeing 
> this with a debug or optimized build?
> 
> 
> On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart  wrote:
> 
>> Yes, it fails on the current trunk (r29112).  That is what started me on the 
>> journey to figure out when things went wrong.  It was working up until 
>> r29058.
>>  
>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
>> Sent: Tuesday, September 03, 2013 2:49 PM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>>  
>> Are you all the way up to the current trunk? There have been a few typo 
>> fixes since the original commit.
>>  
>> I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
>> using free list, so I suspect it is something up in the OOB connect code 
>> itself. I'll take a look and see if something leaps out at me - it seems to 
>> be working fine on IU's odin cluster, which is the only IB-based system I 
>> can access
>>  
>>  
>> On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart  wrote:
>> 
>> 
>> As mentioned in the weekly conference call, I am seeing some strange errors 
>> when using the openib BTL.  I have narrowed down the changeset that broke 
>> things to the ORTE async code.
>>  
>> https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
>> https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
>> compile errors)
>>  
>> Changeset 29057 does not have these issues.  I do not have a very good 
>> characterization of the failures.  The failures are not consistent.  
>> Sometimes they can pass.  Sometimes the stack trace can be different.  They 
>> seem to happen more with larger np, like np=4 and more.   
>>  
>> The first failure mode is a segmentation violation and it always seems to be 
>> that we are trying to pop something of a free list.  But the upper parts of 
>> the stack trace can vary.  This is with the trunk version 29061.
>> Ralph, any thoughts on where we go from here?
>>  
>> [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
>> MPI_Irecv_comm_c
>> MPITEST info  (0): Starting:  MPI_Irecv_comm:   
>> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] 
>> Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not 
>> mapped (1) [compute-0-4:04752] Failing at address: 0x28
>> --
>> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on 
>> signal 11 (Segmentation fault).
>> --
>> [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
>> (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later 
>> Core was generated by `MPI_Irecv_comm_c'.
>> Program terminated with signal 11, Segmentation fault.
>> [New process 4753]
>> [New process 4756]
>> [New process 4755]
>> [New process 4754]
>> [New process 4752]
>> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
>> ../../../../../opal/class/opal_atomic_lifo.h:111
>> 111 lifo->opal_lifo_head = 
>> (opal_list_item_t*)item->opal_list_next;
>> (gdb) where
>> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
>> ../../../../../opal/class/opal_atomic_lifo.h:111
>> #1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
>> item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
>> #2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
>> ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
>> #3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock 
>> (ep=0x59f3120, qp=0)
>> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
>> #4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
>> (endpoint=0x59f3120)
>> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
>> #5  0x2d6fe71c in qp_create_all (endpoint=0x59f3120) at 
>> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
>> #6  0x2d6fde2b in reply_start_connect (endpoint=0x59f3120, 
>> rem_info=0x40ea8ed0)
>> at 
>> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
>> #7  0x2d7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
>> buffer=0x40ea8f80, tag=102, cbdata=0x0)
>> at 
>> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
>> #8  0x2ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
>> cbdata=0x5b0bac0)
>> at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
>> #9  0x2ae8027164a1 in event_process_active_single_queue 

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Ralph Castain
Dang - I just finished running it on odin without a problem. Are you seeing 
this with a debug or optimized build?


On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart  wrote:

> Yes, it fails on the current trunk (r29112).  That is what started me on the 
> journey to figure out when things went wrong.  It was working up until r29058.
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, September 03, 2013 2:49 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>  
> Are you all the way up to the current trunk? There have been a few typo fixes 
> since the original commit.
>  
> I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
> using free list, so I suspect it is something up in the OOB connect code 
> itself. I'll take a look and see if something leaps out at me - it seems to 
> be working fine on IU's odin cluster, which is the only IB-based system I can 
> access
>  
>  
> On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart  wrote:
> 
> 
> As mentioned in the weekly conference call, I am seeing some strange errors 
> when using the openib BTL.  I have narrowed down the changeset that broke 
> things to the ORTE async code.
>  
> https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
> https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
> compile errors)
>  
> Changeset 29057 does not have these issues.  I do not have a very good 
> characterization of the failures.  The failures are not consistent.  
> Sometimes they can pass.  Sometimes the stack trace can be different.  They 
> seem to happen more with larger np, like np=4 and more.   
>  
> The first failure mode is a segmentation violation and it always seems to be 
> that we are trying to pop something of a free list.  But the upper parts of 
> the stack trace can vary.  This is with the trunk version 29061.
> Ralph, any thoughts on where we go from here?
>  
> [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
> MPI_Irecv_comm_c
> MPITEST info  (0): Starting:  MPI_Irecv_comm:   
> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] 
> Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not 
> mapped (1) [compute-0-4:04752] Failing at address: 0x28
> --
> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on 
> signal 11 (Segmentation fault).
> --
> [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
> (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> Core was generated by `MPI_Irecv_comm_c'.
> Program terminated with signal 11, Segmentation fault.
> [New process 4753]
> [New process 4756]
> [New process 4755]
> [New process 4754]
> [New process 4752]
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> 111 lifo->opal_lifo_head = 
> (opal_list_item_t*)item->opal_list_next;
> (gdb) where
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> #1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
> item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
> #2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
> ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
> #3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock 
> (ep=0x59f3120, qp=0)
> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
> #4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
> (endpoint=0x59f3120)
> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
> #5  0x2d6fe71c in qp_create_all (endpoint=0x59f3120) at 
> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
> #6  0x2d6fde2b in reply_start_connect (endpoint=0x59f3120, 
> rem_info=0x40ea8ed0)
> at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
> #7  0x2d7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
> buffer=0x40ea8f80, tag=102, cbdata=0x0)
> at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
> #8  0x2ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
> cbdata=0x5b0bac0)
> at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
> #9  0x2ae8027164a1 in event_process_active_single_queue (base=0x58ac620, 
> activeq=0x58aa5b0)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
> #10 0x2ae802716b24 in event_process_active (base=0x58ac620) at 
> 

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
> wrote:


As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
MPI_Irecv_comm_c
MPITEST info  (0): Starting:  MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: 
Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) 
[compute-0-4:04752] Failing at address: 0x28
--
mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 
11 (Segmentation fault).
--
[rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
(6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
Core was generated by `MPI_Irecv_comm_c'.
Program terminated with signal 11, Segmentation fault.
[New process 4753]
[New process 4756]
[New process 4755]
[New process 4754]
[New process 4752]
#0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
111 lifo->opal_lifo_head = (opal_list_item_t*)item->opal_list_next;
(gdb) where
#0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
#1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
#2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
#3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock (ep=0x59f3120, 
qp=0)
at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
#4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
(endpoint=0x59f3120)
at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
#5  0x2d6fe71c in qp_create_all (endpoint=0x59f3120) at 
../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
#6  0x2d6fde2b in reply_start_connect (endpoint=0x59f3120, 
rem_info=0x40ea8ed0)
at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
#7  0x2d7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
buffer=0x40ea8f80, tag=102, cbdata=0x0)
at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
#8  0x2ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
cbdata=0x5b0bac0)
at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
#9  0x2ae8027164a1 in event_process_active_single_queue (base=0x58ac620, 
activeq=0x58aa5b0)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
#10 0x2ae802716b24 in event_process_active (base=0x58ac620) at 
../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
#11 0x2ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, 
flags=1)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
#12 0x2ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at 
../../orte/runtime/orte_init.c:180
#13 0x003ab1e06367 in start_thread () from /lib64/libpthread.so.0
#14 

Re: [OMPI devel] NO LT_DLADVISE - CANNOT LOAD LIBOMPI JAVA BINDINGS

2013-09-03 Thread Jeff Squyres (jsquyres)
On Sep 2, 2013, at 1:53 AM, Bibrak Qamar  wrote:

> Yes you are right, it does distribute the ltdl in the source library. But 
> isn't it installed by default when OpenMPI is installed?

It certainly should.  But it's part of libopen-pal.so -- not a standalone 
libltdl.so.

If you're running your own copies of the autotools (vs. just using a 
bootstrapped tarball), you probably need to ensure that you have a recent 
enough version of the autotools (libtool, in particular).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Ralph Castain
Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart  wrote:

> As mentioned in the weekly conference call, I am seeing some strange errors 
> when using the openib BTL.  I have narrowed down the changeset that broke 
> things to the ORTE async code.
>  
> https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
> https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
> compile errors)
>  
> Changeset 29057 does not have these issues.  I do not have a very good 
> characterization of the failures.  The failures are not consistent.  
> Sometimes they can pass.  Sometimes the stack trace can be different.  They 
> seem to happen more with larger np, like np=4 and more.   
>  
> The first failure mode is a segmentation violation and it always seems to be 
> that we are trying to pop something of a free list.  But the upper parts of 
> the stack trace can vary.  This is with the trunk version 29061.
> Ralph, any thoughts on where we go from here?
>  
> [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
> MPI_Irecv_comm_c
> MPITEST info  (0): Starting:  MPI_Irecv_comm:   
> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] 
> Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not 
> mapped (1) [compute-0-4:04752] Failing at address: 0x28
> --
> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on 
> signal 11 (Segmentation fault).
> --
> [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
> (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> Core was generated by `MPI_Irecv_comm_c'.
> Program terminated with signal 11, Segmentation fault.
> [New process 4753]
> [New process 4756]
> [New process 4755]
> [New process 4754]
> [New process 4752]
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> 111 lifo->opal_lifo_head = 
> (opal_list_item_t*)item->opal_list_next;
> (gdb) where
> #0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> #1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
> item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
> #2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
> ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
> #3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock 
> (ep=0x59f3120, qp=0)
> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
> #4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
> (endpoint=0x59f3120)
> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
> #5  0x2d6fe71c in qp_create_all (endpoint=0x59f3120) at 
> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
> #6  0x2d6fde2b in reply_start_connect (endpoint=0x59f3120, 
> rem_info=0x40ea8ed0)
> at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
> #7  0x2d7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
> buffer=0x40ea8f80, tag=102, cbdata=0x0)
> at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
> #8  0x2ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
> cbdata=0x5b0bac0)
> at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
> #9  0x2ae8027164a1 in event_process_active_single_queue (base=0x58ac620, 
> activeq=0x58aa5b0)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
> #10 0x2ae802716b24 in event_process_active (base=0x58ac620) at 
> ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
> #11 0x2ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, 
> flags=1)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
> #12 0x2ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at 
> ../../orte/runtime/orte_init.c:180
> #13 0x003ab1e06367 in start_thread () from /lib64/libpthread.so.0
> #14 0x003ab16d2f7d in clone () from /lib64/libc.so.6
> (gdb)
>  
> This email message is for the sole use of the intended recipient(s) and may 
> contain confidential information.  Any unauthorized review, use, disclosure 
> or distribution is 

Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-09-03 Thread Jeff Squyres (jsquyres)
Hmm.  Are you building Open MPI in a special way?  I ask because I'm unable to 
replicate the issue -- I've run your test (and a C equivalent) a few hundred 
times now:


[jsquyres@savbu-usnic-a mpi]$ which gfortran
/usr/bin/gfortran
[jsquyres@savbu-usnic-a mpi]$ gfortran --version
GNU Fortran (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
Copyright (C) 2010 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

[jsquyres@savbu-usnic-a mpi]$ mpifort gnumyhello_f90.f90 -o gnumyhello_f90
[jsquyres@savbu-usnic-a mpi]$ mpicc gnumyhello.c -o gnumyhello
[jsquyres@savbu-usnic-a mpi]$ ulimit -v 1048576
[jsquyres@savbu-usnic-a mpi]$ ./gnumyhello
Hello, world, I am 0 of 1
Failed to allocate
[jsquyres@savbu-usnic-a mpi]$ ./gnumyhello_f90
 Hello, world, I am0  of1
 Task0  failed to allocate7.9721212387084961  GB
[jsquyres@savbu-usnic-a mpi]$
-

No segvs, no core files, etc.


On Sep 2, 2013, at 2:51 AM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 02/09/13 16:32, Christopher Samuel wrote:
> 
>> I cannot duplicate this under valgrind or gdb and given that this
>> doesn't happen every time I run it and gdb indicates there are at
>> least 2 threads running then we're wondering if this is a race condition.
> 
> I have also duplicated this problem with 1.7.3a1r29103.
> 
> Hello, world, I am0  of1
> [barcoo:03306] *** Process received signal ***
> [barcoo:03306] Signal: Segmentation fault (11)
> [barcoo:03306] Signal code: Address not mapped (1)
> [barcoo:03306] Failing at address: 0x2009b4298
> [barcoo:03306] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500]
> [barcoo:03306] [ 1] 
> /usr/local/openmpi/1.7.3a1r29103/lib/libopen-pal.so.5(opal_memory_ptmalloc2_int_malloc+0x96a)
>  [0x7f47de6935aa]
> [barcoo:03306] [ 2] 
> /usr/local/openmpi/1.7.3a1r29103/lib/libopen-pal.so.5(opal_memory_ptmalloc2_malloc+0x52)
>  [0x7f47de694612]
> [barcoo:03306] [ 3] ./1.7-gnumyhello_f90() [0x400dca]
> [barcoo:03306] [ 4] ./1.7-gnumyhello_f90() [0x40104a]
> [barcoo:03306] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd]
> [barcoo:03306] [ 6] ./1.7-gnumyhello_f90() [0x400bc9]
> [barcoo:03306] *** End of error message ***
> 
> The backtrace I get from the core file isn't as useful though:
> 
> (gdb) bt full
> #0  0x7fd9c4c255aa in opal_memory_ptmalloc2_int_malloc () from 
> /usr/local/openmpi/1.7.3a1r29103/lib/libopen-pal.so.5
> No symbol table info available.
> #1  0x7fd9c4c26612 in opal_memory_ptmalloc2_malloc () from 
> /usr/local/openmpi/1.7.3a1r29103/lib/libopen-pal.so.5
> No symbol table info available.
> #2  0x00400dca in main () at gnumyhello_f90.f90:26
>ierr = 0
>rank = 0
>size = 1
>work = 
> #3  0x0040104a in main ()
> No symbol table info available.
> 
> OMPI 1.7 is built with exactly the same configure options as 1.6
> and the executable is built with -g -O0.
> 
> cheers,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlIkNXsACgkQO2KABBYQAh9fhQCdHUrlsl3ftY8VyDNRa8E8jKBx
> BZkAnjJJIXgUzRV8T+VBmrS0MQjXS8zO
> =B7GU
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
MPI_Irecv_comm_c
MPITEST info  (0): Starting:  MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: 
Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) 
[compute-0-4:04752] Failing at address: 0x28
--
mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 
11 (Segmentation fault).
--
[rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
(6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
Core was generated by `MPI_Irecv_comm_c'.
Program terminated with signal 11, Segmentation fault.
[New process 4753]
[New process 4756]
[New process 4755]
[New process 4754]
[New process 4752]
#0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
111 lifo->opal_lifo_head = (opal_list_item_t*)item->opal_list_next;
(gdb) where
#0  0x2d6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
#1  0x2d6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
#2  0x2d6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
#3  0x2d6ec1ae in mca_btl_openib_endpoint_post_rr_nolock (ep=0x59f3120, 
qp=0)
at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
#4  0x2d6ebfad in mca_btl_openib_endpoint_post_recvs 
(endpoint=0x59f3120)
at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
#5  0x2d6fe71c in qp_create_all (endpoint=0x59f3120) at 
../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
#6  0x2d6fde2b in reply_start_connect (endpoint=0x59f3120, 
rem_info=0x40ea8ed0)
at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
#7  0x2d7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
buffer=0x40ea8f80, tag=102, cbdata=0x0)
at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
#8  0x2ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
cbdata=0x5b0bac0)
at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
#9  0x2ae8027164a1 in event_process_active_single_queue (base=0x58ac620, 
activeq=0x58aa5b0)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
#10 0x2ae802716b24 in event_process_active (base=0x58ac620) at 
../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
#11 0x2ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, 
flags=1)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
#12 0x2ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at 
../../orte/runtime/orte_init.c:180
#13 0x003ab1e06367 in start_thread () from /lib64/libpthread.so.0
#14 0x003ab16d2f7d in clone () from /lib64/libc.so.6
(gdb)


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---