Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-19 Thread Ralph Castain
Are you sure you're using the same version of OMPI on this new OS? They 
typically distribute one in your default path, so I'd check to ensure that you 
really are using the version you think.


On Jul 19, 2013, at 12:49 PM, "Kevin H. Hobbs"  wrote:

> I just upgraded the OS on one of my workstations from Fedora 17 to 18
> and now I can't run even the simplest MPI programs.
> 
> I filed a bug report with Fedora's bug tracker :
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=986409
> 
> My simple program is attached as mpi_simple.c
> 
> mpicc works :
> 
>  mpicc -g -o mpi_simple mpi_simple.c
> 
> I can even take the generated program to another computer and it runs fine.
> 
> I can run mon MPI programs with mpirun :
> 
>  mpirun -n 4 hostname
>  murron.hobbs-hancock
>  murron.hobbs-hancock
>  murron.hobbs-hancock
>  murron.hobbs-hancock
> 
> When I run a program that calls MPI_Init I get an error which includes :
> 
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  orte_util_nidmap_init failed
>  --> Returned value Error (-1) instead of ORTE_SUCCESS
> --
> 
> The output of :
> 
> mpirun -n 1 mpi_simple
> 
> is attached as error.txt
> 
> I suspect it matters that this is a lenovo S10 with what /proc/cpuinfo
> calls a "Intel(R) Core(TM)2 Quad CPUQ6600"
> 
> I did a bit of poking around in gdb but I don't know what I'm looking for.
> 
> Does anybody have an idea what's going on?
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-19 Thread Kevin H. Hobbs
On 07/19/2013 05:11 PM, Ralph Castain wrote:
> Are you sure you're using the same version of OMPI on this new OS?

No, I'm sure I'm using a different version of Open MPI in Fedora
18 from the one I was using in Fedora 17.

I have only the Open MPI provided by the Fedora distribution.

> They typically distribute one in your default path, 

Fedora allows both Open MPI and MPICH to be installed at the same
time by using the module system.

Neither is in the default path, I have to put them in my path with :

  module load mpi/openmpi-x86_64

which is in my ~/.bashrc .

> so I'd check to ensure that you really are using the version you think.

'locate mpicc' and 'locate mpirun' only find one hit each so I'm
reasonably sure I'm running what I think I'm running.

That being said packaging bugs have happened before is there any
library, config file, or executable that you would suggest I look
for that might have come from a prior version of Open MPI or its
dependencies?




signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-19 Thread Jeff Squyres (jsquyres)
Not offhand.  The error you're seeing *typically* indicates that you've got a 
mismatch of OMPI version somewhere.  Are you running on multiple machines with 
different Open MPI versions, perchance?

If you're running only on a single machine, try completely uninstalling the 
Open MPI package, re-installing it, recompiling your trivial app, and see what 
happens.

Also, you might want to check the output of "mpicc yourapp.c --showme" and see 
if it's pointing to the right libraries, etc.


On Jul 19, 2013, at 7:06 PM, "Kevin H. Hobbs"  wrote:

> On 07/19/2013 05:11 PM, Ralph Castain wrote:
>> Are you sure you're using the same version of OMPI on this new OS?
> 
> No, I'm sure I'm using a different version of Open MPI in Fedora
> 18 from the one I was using in Fedora 17.
> 
> I have only the Open MPI provided by the Fedora distribution.
> 
>> They typically distribute one in your default path, 
> 
> Fedora allows both Open MPI and MPICH to be installed at the same
> time by using the module system.
> 
> Neither is in the default path, I have to put them in my path with :
> 
>  module load mpi/openmpi-x86_64
> 
> which is in my ~/.bashrc .
> 
>> so I'd check to ensure that you really are using the version you think.
> 
> 'locate mpicc' and 'locate mpirun' only find one hit each so I'm
> reasonably sure I'm running what I think I'm running.
> 
> That being said packaging bugs have happened before is there any
> library, config file, or executable that you would suggest I look
> for that might have come from a prior version of Open MPI or its
> dependencies?
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-20 Thread Kevin H. Hobbs
On 07/19/2013 08:27 PM, Jeff Squyres (jsquyres) wrote:
> Not offhand.  The error you're seeing *typically* indicates
> that you've got a mismatch of OMPI version somewhere.

So now the fun part for me is to try and find it or in failing to
find it eliminate the multiple versions theory.

> Are you running on multiple machines with different Open MPI
> versions, perchance?

Just one machine right now.


> If you're running only on a single machine, try completely
> uninstalling the Open MPI package, re-installing it,
> recompiling your trivial app, and see what happens.

That's easy enough :

"yum list openmpi*" says I have openmpi.x86_64,
openmpi-debuginfo.x86_64, and openmpi-devel.x86_64 installed.

I did :

  sudo yum remove \
openmpi.x86_64 \
openmpi-debuginfo.x86_64 \
openmpi-devel.x86_64

followed by :

  sudo yum install \
openmpi.x86_64 \
openmpi-debuginfo.x86_64 \
openmpi-devel.x86_64

Then I compiled and ran the program :

  mpicc -g -o mpi_simple mpi_simple.c
  mpirun -n 1 mpi_simple

and got the now familiar error.

> Also, you might want to check the output of "mpicc yourapp.c
> --showme" and see if it's pointing to the right libraries, etc.

  mpicc --showme -g -o mpi_simple mpi_simple.c
  gcc -g -o mpi_simple mpi_simple.c \
-I/usr/include/openmpi-x86_64 -pthread -m64 \
-L/usr/lib64/openmpi/lib -lmpi

Is anything hiding there that doesn't belong?

  find /usr/include/openmpi-x86_64/ \
-exec rpm -q --whatprovides {} \; | sort -u

  openmpi-devel-1.6.3-7.fc18.x86_64

  find /usr/lib64/openmpi/lib \
-exec rpm -q --whatprovides {} \; | sort -u

  openmpi-1.6.3-7.fc18.x86_64
  openmpi-devel-1.6.3-7.fc18.x86_64

What is the program actually linked to?

  ldd mpi_simple
linux-vdso.so.1 =>  (0x7fff34151000)
libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1
(0x7f079fa92000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x003c53e0)
libc.so.6 => /lib64/libc.so.6 (0x003c5320)
libdl.so.2 => /lib64/libdl.so.2 (0x003c53a0)
librt.so.1 => /lib64/librt.so.1 (0x003c5420)
libnsl.so.1 => /lib64/libnsl.so.1 (0x003c6c20)
libutil.so.1 => /lib64/libutil.so.1 (0x003c6de0)
libm.so.6 => /lib64/libm.so.6 (0x003c5360)
libhwloc.so.5 => /lib64/libhwloc.so.5 (0x003c5760)
libltdl.so.7 => /lib64/libltdl.so.7 (0x003c7700)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x003c54a0)
/lib64/ld-linux-x86-64.so.2 (0x003c52e0)
libnuma.so.1 => /lib64/libnuma.so.1 (0x003c5720)
libpci.so.3 => /lib64/libpci.so.3 (0x003c55e0)
libxml2.so.2 => /lib64/libxml2.so.2 (0x003c5d60)
libresolv.so.2 => /lib64/libresolv.so.2 (0x003c55a0)
libz.so.1 => /lib64/libz.so.1 (0x003c5460)
liblzma.so.5 => /lib64/liblzma.so.5 (0x003c5960)

What packages provides them?

  rpm -q --whatprovides \
/usr/lib64/openmpi/lib/libmpi.so.1 \
/lib64/libpthread.so.0 \
/lib64/libc.so.6 \
/lib64/libdl.so.2 \
/lib64/librt.so.1 \
/lib64/libnsl.so.1 \
/lib64/libutil.so.1 \
/lib64/libm.so.6 \
/lib64/libhwloc.so.5 \
/lib64/libltdl.so.7 \
/lib64/libgcc_s.so.1 \
/lib64/libnuma.so.1 \
/lib64/libpci.so.3 \
/lib64/libxml2.so.2 \
/lib64/libresolv.so.2 \
/lib64/libz.so.1 \
/lib64/liblzma.so.5 | sort -u

  glibc-2.16-33.fc18.x86_64
  hwloc-1.4.2-2.fc18.x86_64
  libgcc-4.7.2-8.fc18.x86_64
  libtool-ltdl-2.4.2-7.fc18.x86_64
  libxml2-2.9.1-1.fc18.1.x86_64
  numactl-libs-2.0.7-7.fc18.x86_64
  openmpi-1.6.3-7.fc18.x86_64
  pciutils-libs-3.1.10-2.fc18.x86_64
  xz-libs-5.1.2-2alpha.fc18.x86_64
  zlib-1.2.7-9.fc18.x86_64

I don't see any Fedora 17 stragglers or anything weird.



signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-20 Thread Ralph Castain
Afraid I have no earthly idea of the problem - you might try taking it up with 
the package provider. I usually advise people to avoid the packages as you have 
no idea how they were built and thus might find they don't fully support your 
configuration. Not that hard to just download and build the tarball.

Sorry I can't be of more help


On Jul 20, 2013, at 6:09 AM, Kevin H. Hobbs  wrote:

> On 07/19/2013 08:27 PM, Jeff Squyres (jsquyres) wrote:
>> Not offhand.  The error you're seeing *typically* indicates
>> that you've got a mismatch of OMPI version somewhere.
> 
> So now the fun part for me is to try and find it or in failing to
> find it eliminate the multiple versions theory.
> 
>> Are you running on multiple machines with different Open MPI
>> versions, perchance?
> 
> Just one machine right now.
> 
> 
>> If you're running only on a single machine, try completely
>> uninstalling the Open MPI package, re-installing it,
>> recompiling your trivial app, and see what happens.
> 
> That's easy enough :
> 
> "yum list openmpi*" says I have openmpi.x86_64,
> openmpi-debuginfo.x86_64, and openmpi-devel.x86_64 installed.
> 
> I did :
> 
>  sudo yum remove \
>openmpi.x86_64 \
>openmpi-debuginfo.x86_64 \
>openmpi-devel.x86_64
> 
> followed by :
> 
>  sudo yum install \
>openmpi.x86_64 \
>openmpi-debuginfo.x86_64 \
>openmpi-devel.x86_64
> 
> Then I compiled and ran the program :
> 
>  mpicc -g -o mpi_simple mpi_simple.c
>  mpirun -n 1 mpi_simple
> 
> and got the now familiar error.
> 
>> Also, you might want to check the output of "mpicc yourapp.c
>> --showme" and see if it's pointing to the right libraries, etc.
> 
>  mpicc --showme -g -o mpi_simple mpi_simple.c
>  gcc -g -o mpi_simple mpi_simple.c \
>-I/usr/include/openmpi-x86_64 -pthread -m64 \
>-L/usr/lib64/openmpi/lib -lmpi
> 
> Is anything hiding there that doesn't belong?
> 
>  find /usr/include/openmpi-x86_64/ \
>-exec rpm -q --whatprovides {} \; | sort -u
> 
>  openmpi-devel-1.6.3-7.fc18.x86_64
> 
>  find /usr/lib64/openmpi/lib \
>-exec rpm -q --whatprovides {} \; | sort -u
> 
>  openmpi-1.6.3-7.fc18.x86_64
>  openmpi-devel-1.6.3-7.fc18.x86_64
> 
> What is the program actually linked to?
> 
>  ldd mpi_simple
>linux-vdso.so.1 =>  (0x7fff34151000)
>libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1
> (0x7f079fa92000)
>libpthread.so.0 => /lib64/libpthread.so.0 (0x003c53e0)
>libc.so.6 => /lib64/libc.so.6 (0x003c5320)
>libdl.so.2 => /lib64/libdl.so.2 (0x003c53a0)
>librt.so.1 => /lib64/librt.so.1 (0x003c5420)
>libnsl.so.1 => /lib64/libnsl.so.1 (0x003c6c20)
>libutil.so.1 => /lib64/libutil.so.1 (0x003c6de0)
>libm.so.6 => /lib64/libm.so.6 (0x003c5360)
>libhwloc.so.5 => /lib64/libhwloc.so.5 (0x003c5760)
>libltdl.so.7 => /lib64/libltdl.so.7 (0x003c7700)
>libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x003c54a0)
>/lib64/ld-linux-x86-64.so.2 (0x003c52e0)
>libnuma.so.1 => /lib64/libnuma.so.1 (0x003c5720)
>libpci.so.3 => /lib64/libpci.so.3 (0x003c55e0)
>libxml2.so.2 => /lib64/libxml2.so.2 (0x003c5d60)
>libresolv.so.2 => /lib64/libresolv.so.2 (0x003c55a0)
>libz.so.1 => /lib64/libz.so.1 (0x003c5460)
>liblzma.so.5 => /lib64/liblzma.so.5 (0x003c5960)
> 
> What packages provides them?
> 
>  rpm -q --whatprovides \
>/usr/lib64/openmpi/lib/libmpi.so.1 \
>/lib64/libpthread.so.0 \
>/lib64/libc.so.6 \
>/lib64/libdl.so.2 \
>/lib64/librt.so.1 \
>/lib64/libnsl.so.1 \
>/lib64/libutil.so.1 \
>/lib64/libm.so.6 \
>/lib64/libhwloc.so.5 \
>/lib64/libltdl.so.7 \
>/lib64/libgcc_s.so.1 \
>/lib64/libnuma.so.1 \
>/lib64/libpci.so.3 \
>/lib64/libxml2.so.2 \
>/lib64/libresolv.so.2 \
>/lib64/libz.so.1 \
>/lib64/liblzma.so.5 | sort -u
> 
>  glibc-2.16-33.fc18.x86_64
>  hwloc-1.4.2-2.fc18.x86_64
>  libgcc-4.7.2-8.fc18.x86_64
>  libtool-ltdl-2.4.2-7.fc18.x86_64
>  libxml2-2.9.1-1.fc18.1.x86_64
>  numactl-libs-2.0.7-7.fc18.x86_64
>  openmpi-1.6.3-7.fc18.x86_64
>  pciutils-libs-3.1.10-2.fc18.x86_64
>  xz-libs-5.1.2-2alpha.fc18.x86_64
>  zlib-1.2.7-9.fc18.x86_64
> 
> I don't see any Fedora 17 stragglers or anything weird.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-20 Thread Kevin H. Hobbs
On 07/20/2013 10:28 AM, Ralph Castain wrote:
> Afraid I have no earthly idea of the problem - you might try
> taking it up with the package provider.

This is a link to the bug report filed in the Fedora bugzilla

https://bugzilla.redhat.com/show_bug.cgi?id=986409

The advice I got there was to come here.

> I usually advise people to avoid the packages as you have no
> idea how they were built and thus might find they don't fully
> support your configuration.

I took a peek at the src.rpm it wasn't one of those hundred patch
monsters.

> Not that hard to just download and build the tarball.

It usually isn't, I'm just a little reluctant to build yet
another package that's provided by my distribution.

> 
> Sorry I can't be of more help

No problem.

I'm off to build openmpi from source!







signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-20 Thread Kevin H. Hobbs
On 07/20/2013 10:28 AM, Ralph Castain wrote:
> avoid the packages as you have no idea how they were built

So I built openmpi-1.6.5 from the tar ball and of course
everything works fine well my simple program got through
Mpi_init and printed its rank.

I configured it very simply :

  ./configure --prefix=/opt/openmpi-1.6.5

and I noticed that the generated program does not link hwloc like
the program did when I used the system mpicc.

I looked at the installed the openmpi source rpm and took a look
at the configure section and it uses --with-hwloc=/usr so I
reconfigured the openmpi-1.6.5 source with :

  ./configure \
--prefix=/opt/openmpi-1.6.5 \
--with-hwloc=/usr

and when I rebuilt my program with the generated openmpi my error
returns.



signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-20 Thread Ralph Castain
Ah! That would indicate an issue with the external hwloc package they provided, 
which is the big reason we don't recommend installing from packages. We have 
internal copies of hwloc and libevent that ensure (a) they are at the proper 
level, and (b) they are configured properly for OMPI's use.

We've had that debate with the Fedora folks recently - they won't take our 
advice, so I'm afraid you'll just need to build from source to have something 
usable.

On Jul 20, 2013, at 1:07 PM, "Kevin H. Hobbs"  wrote:

> On 07/20/2013 10:28 AM, Ralph Castain wrote:
>> avoid the packages as you have no idea how they were built
> 
> So I built openmpi-1.6.5 from the tar ball and of course
> everything works fine well my simple program got through
> Mpi_init and printed its rank.
> 
> I configured it very simply :
> 
>  ./configure --prefix=/opt/openmpi-1.6.5
> 
> and I noticed that the generated program does not link hwloc like
> the program did when I used the system mpicc.
> 
> I looked at the installed the openmpi source rpm and took a look
> at the configure section and it uses --with-hwloc=/usr so I
> reconfigured the openmpi-1.6.5 source with :
> 
>  ./configure \
>--prefix=/opt/openmpi-1.6.5 \
>--with-hwloc=/usr
> 
> and when I rebuilt my program with the generated openmpi my error
> returns.
> 




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-21 Thread Kevin H. Hobbs
On 07/20/2013 04:14 PM, Ralph Castain wrote:
> Ah! That would indicate an issue with the external hwloc
> package they provided, which is the big reason we don't
> recommend installing from packages.

I'll happily report the bug to the hwloc developers.

I'll also add what we've found here to the bug on the Fedora
bugzilla.

Is there anything more I can do on this list to figure out the
nature of the bug?

> We have internal copies of hwloc and libevent that ensure (a)
> they are at the proper level, and (b) they are configured
> properly for OMPI's use.

It does look like Fedora's hwloc is ahead of OMPI's.

Fedora 18 has openmpi-1.6.3 and hwloc-1.4.2.

The source of openmpi-1.6.5 has hwloc-1.3.2.

How can I tell what the configuration differences are?

The entire configure section of the .spec file in
hwloc-1.4.2-2.fc18.src.rpm is :

  %configure
  %{__make} %{?_smp_mflags} V=1

I don't see anything that looks like any hwloc configure options
are being set.

How do I tell how OMPI configures it's bundled hwloc?

> We've had that debate with the Fedora folks recently - they
> won't take our advice,

I admit to having a bias against bundled dependencies, but I
don't want to get in the middle of a debate.

> so I'm afraid you'll just need to build from source to have
> something usable.

Yes, that solves my problem, but if I can I'd like to leave at
least a trail of breadcrumbs that the next person to encounter
the bug can follow.

Better yet, I'd like to figure out the actual nature of the bug
and report it in the proper place.



signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Jeff Squyres (jsquyres)
On Jul 21, 2013, at 8:50 AM, Kevin H. Hobbs  wrote:

>> Ah! That would indicate an issue with the external hwloc
>> package they provided, which is the big reason we don't
>> recommend installing from packages.
> 
> I'll happily report the bug to the hwloc developers.

I don't think that this is necessarily an hwloc bug.

> I'll also add what we've found here to the bug on the Fedora
> bugzilla.
> 
> Is there anything more I can do on this list to figure out the
> nature of the bug?
> 
>> We have internal copies of hwloc and libevent that ensure (a)
>> they are at the proper level, and (b) they are configured
>> properly for OMPI's use.
> 
> It does look like Fedora's hwloc is ahead of OMPI's.
> 
> Fedora 18 has openmpi-1.6.3 and hwloc-1.4.2.
> 
> The source of openmpi-1.6.5 has hwloc-1.3.2.

Hypothetically, hwloc 1.4.x is backwards source-compatible with hwloc 1.3.x, 
but we have not tested this.  I don't know if hwloc has, either (I'm sure they 
haven't tested with Open MPI 1.6.x).

> How can I tell what the configuration differences are?
> 
> The entire configure section of the .spec file in
> hwloc-1.4.2-2.fc18.src.rpm is :
> 
>  %configure
>  %{__make} %{?_smp_mflags} V=1

OMPI builds hwloc in "embedded" mode, which means that OMPI's configure line is 
used to build hwloc (vs. having a separate configure invocation for hwloc).  
They're hypothetically the moral equivalent of each other, but perhaps 
something is different somehow...

> I don't see anything that looks like any hwloc configure options
> are being set.
> 
> How do I tell how OMPI configures it's bundled hwloc?

With this embedded mechanism, we're calling hwloc's configury with the moral 
equivalent of:

./configure --disable-cairo --disable-libxml2 --enable-xml 
--with-hwloc-symbol-prefix=opal_hwloc152_ --enable-embedded-mode

> Better yet, I'd like to figure out the actual nature of the bug
> and report it in the proper place.


Yes, it's curious that they can't reproduce your issue, which suggests that the 
hwloc issue is a red herring (because, as stated above, hwloc *should* be 
backwards compatible).

Ralph: is there an easy way to find out more detail on why 
orte_util_nidmap_init() failed without attaching a debugger?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Ralph Castain

On Jul 23, 2013, at 3:56 AM, Jeff Squyres (jsquyres)  wrote:

> On Jul 21, 2013, at 8:50 AM, Kevin H. Hobbs  wrote:
> 
>>> Ah! That would indicate an issue with the external hwloc
>>> package they provided, which is the big reason we don't
>>> recommend installing from packages.
>> 
>> I'll happily report the bug to the hwloc developers.
> 
> I don't think that this is necessarily an hwloc bug.
> 
>> I'll also add what we've found here to the bug on the Fedora
>> bugzilla.
>> 
>> Is there anything more I can do on this list to figure out the
>> nature of the bug?
>> 
>>> We have internal copies of hwloc and libevent that ensure (a)
>>> they are at the proper level, and (b) they are configured
>>> properly for OMPI's use.
>> 
>> It does look like Fedora's hwloc is ahead of OMPI's.
>> 
>> Fedora 18 has openmpi-1.6.3 and hwloc-1.4.2.
>> 
>> The source of openmpi-1.6.5 has hwloc-1.3.2.
> 
> Hypothetically, hwloc 1.4.x is backwards source-compatible with hwloc 1.3.x, 
> but we have not tested this.  I don't know if hwloc has, either (I'm sure 
> they haven't tested with Open MPI 1.6.x).
> 
>> How can I tell what the configuration differences are?
>> 
>> The entire configure section of the .spec file in
>> hwloc-1.4.2-2.fc18.src.rpm is :
>> 
>> %configure
>> %{__make} %{?_smp_mflags} V=1
> 
> OMPI builds hwloc in "embedded" mode, which means that OMPI's configure line 
> is used to build hwloc (vs. having a separate configure invocation for 
> hwloc).  They're hypothetically the moral equivalent of each other, but 
> perhaps something is different somehow...
> 
>> I don't see anything that looks like any hwloc configure options
>> are being set.
>> 
>> How do I tell how OMPI configures it's bundled hwloc?
> 
> With this embedded mechanism, we're calling hwloc's configury with the moral 
> equivalent of:
> 
> ./configure --disable-cairo --disable-libxml2 --enable-xml 
> --with-hwloc-symbol-prefix=opal_hwloc152_ --enable-embedded-mode
> 
>> Better yet, I'd like to figure out the actual nature of the bug
>> and report it in the proper place.
> 
> 
> Yes, it's curious that they can't reproduce your issue,

Guess I missed this - where does it say that they can't reproduce the issue?? 
I'm suspicious because build-from-source produced a working result.

> which suggests that the hwloc issue is a red herring (because, as stated 
> above, hwloc *should* be backwards compatible).
> 
> Ralph: is there an easy way to find out more detail on why 
> orte_util_nidmap_init() failed without attaching a debugger?

A debugger would be the best way.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Jeff Squyres (jsquyres)
On Jul 23, 2013, at 8:54 AM, Ralph Castain  wrote:

>> Yes, it's curious that they can't reproduce your issue,
> 
> Guess I missed this - where does it say that they can't reproduce the issue?? 
> I'm suspicious because build-from-source produced a working result.

Orion mentioned it in https://bugzilla.redhat.com/show_bug.cgi?id=986409.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Ralph Castain
I see - I didn't look at the redhat bug list. Sadly, I have no idea how to 
debug it. The Fedora package is built optimized, so no OMPI debugging output is 
available and a debugger won't tell us a lot.

Best guess is that there is something in the build that doesn't match the 
user's system. The nidmap_init routine unpacks a buffer that contains a bunch 
of process mapping info that mpirun packed into it - don't usually see an error 
in there.


On Jul 23, 2013, at 5:57 AM, "Jeff Squyres (jsquyres)"  
wrote:

> On Jul 23, 2013, at 8:54 AM, Ralph Castain  wrote:
> 
>>> Yes, it's curious that they can't reproduce your issue,
>> 
>> Guess I missed this - where does it say that they can't reproduce the 
>> issue?? I'm suspicious because build-from-source produced a working result.
> 
> Orion mentioned it in https://bugzilla.redhat.com/show_bug.cgi?id=986409.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Kevin H. Hobbs
On 07/23/2013 09:36 AM, Ralph Castain wrote:
> The Fedora package is built optimized, so no OMPI debugging output is
> available and a debugger won't tell us a lot.

The fedora package comes with a debuginfo package that has everything
gdb needs to let me step through the openmpi functions.

I also have the source .rpm installed I can just change the .spec and
rebuild any way you want.

The trouble is there is a lot of unpack this... unpack that... before
the debugger gets to the error

Ahhh! I forgot about 'step count'. I can narrow down on this MUCH more
closely in a hours.




signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Jeff Squyres (jsquyres)
Kevin --

I don't know if Fedora RPMs include -g in their builds, or if Fedora includes a 
debuginfo RPM that you could install such that you can attach a debugger and be 
able to dig into OMPI's internals yourself.

If that doesn't work, you might need to build from source yourself, link 
against the external hwloc (you said you could replicate the error this way), 
and compile with -g (e.g., "./configure CFLAGS=-g LDFLAGS=-g ...").  This would 
allow you to gdb attach and see what's going on.  

Alternatively, you could add some opal_output(0, "printf like args here"); 
statements in the orte_util_nidmap_init() function to see where it's failing 
(look in orte/util/nidmap.c).



On Jul 23, 2013, at 9:36 AM, Ralph Castain  wrote:

> I see - I didn't look at the redhat bug list. Sadly, I have no idea how to 
> debug it. The Fedora package is built optimized, so no OMPI debugging output 
> is available and a debugger won't tell us a lot.
> 
> Best guess is that there is something in the build that doesn't match the 
> user's system. The nidmap_init routine unpacks a buffer that contains a bunch 
> of process mapping info that mpirun packed into it - don't usually see an 
> error in there.
> 
> 
> On Jul 23, 2013, at 5:57 AM, "Jeff Squyres (jsquyres)"  
> wrote:
> 
>> On Jul 23, 2013, at 8:54 AM, Ralph Castain  wrote:
>> 
 Yes, it's curious that they can't reproduce your issue,
>>> 
>>> Guess I missed this - where does it say that they can't reproduce the 
>>> issue?? I'm suspicious because build-from-source produced a working result.
>> 
>> Orion mentioned it in https://bugzilla.redhat.com/show_bug.cgi?id=986409.
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Kevin H. Hobbs
On 07/23/2013 06:56 AM, Jeff Squyres (jsquyres) wrote:
> With this embedded mechanism, we're calling hwloc's configury with
> the moral equivalent of:
> 
> ./configure --disable-cairo --disable-libxml2 --enable-xml
> --with-hwloc-symbol-prefix=opal_hwloc152_ --enable-embedded-mode

I configured hwloc-1.4.3 with :

./configure \
  --prefix=/opt/hwloc-1.4.3 \
  --disable-cairo \
  --disable-libxml2

I'm left off --with-hwloc-symbol-prefix=opal_hwloc152_ because there
seems to be no way to tell openmpi-1.6.5 about this name mangling.

I left off --enable-embedded-mode because with this option nothing is
installed.

I left off --enable-xml because configure warns :

configure: WARNING: unrecognized options: --enable-xml

I configured openmpi-1.6.5 with :

./configure \
  --prefix=/opt/openmpi-1.6.5_hwloc-1.4.3 \
  --with-hwloc=/opt/hwloc-1.4.3

I built my simple program with :

/opt/openmpi-1.6.5_hwloc-1.4.3/bin/mpicc -g \
  -o mpi_simple mpi_simple.c

and I ran it with :

/opt/openmpi-1.6.5_hwloc-1.4.3/bin/mpirun -n 1 mpi_simple

and got as output "my rank is 0 of 1".

So this the _only_ openmpi build configured with the"--with-hwloc="
option that has worked.



signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Kevin H. Hobbs
On 07/23/2013 09:54 AM, Jeff Squyres (jsquyres) wrote:
> 
> I don't know if Fedora RPMs include -g in their builds, or if Fedora
> includes a debuginfo RPM that you could install such that you can attach
> a debugger and be able to dig into OMPI's internals yourself.
> 

There is a debuginfo package.

Since I removed all of fedora's openmpi packages and installed from
source into /opt/openmpi-1.6.5 and /opt/openmpi-1.6.5_hwloc-1.4.3 to
narrow down on this problem, I now have to re-install the rpms with yum.

sudo yum install openmpi openmpi-devel openmpi-debuginfo

These don't put anything into my PATH or LD_LIBRARY_PATH so I have to :

module load mpi/openmpi-x86_64

I compiled my simple program with :

mpicc -g -o mpi_simple mpi_simple.c

The program links to fedora's copies of the libraries of interest :

mpirun -n 1 ldd mpi_simple | grep hwloc
  libhwloc.so.5 => /lib64/libhwloc.so.5 (0x003c5760)
mpirun -n 1 ldd mpi_simple | grep mpi
  libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x7f7207e29000)

I started the debugger with :

mpirun -n 1 gdb mpi_simple

When run in the debugger I got the error I described.

I reran and in gdb did :

set breakpoint pending on
break util/nidmap.c:146
run
step

took me into 'opal_dss_unpack' Then I did 'next' until I got passed
'opal_dss_unpack_buffer' which returned the -1 we see outside.





signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Ralph Castain
Yeah, it's failing when trying to unpack the topology obtained from hwloc. My 
guess is that one of the following calls changed in hwloc-1.4.3:

if (0 != hwloc_topology_set_xmlbuffer(t, xmlbuffer, strlen(xmlbuffer))) 
{
rc = OPAL_ERROR;
free(xmlbuffer);
hwloc_topology_destroy(t);
goto cleanup;
}
/* since we are loading this from an external source, we have to
 * explicitly set a flag so hwloc sets things up correctly
 */
if (0 != hwloc_topology_set_flags(t, 
HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM)) {
free(xmlbuffer);
rc = OPAL_ERROR;
goto cleanup;
}

Only other things in that routing are hwloc_topology_init and 
hwloc_topology_load, and those haven't changed in awhile.


On Jul 23, 2013, at 11:12 AM, Kevin H. Hobbs  wrote:

> On 07/23/2013 09:54 AM, Jeff Squyres (jsquyres) wrote:
>> 
>> I don't know if Fedora RPMs include -g in their builds, or if Fedora
>> includes a debuginfo RPM that you could install such that you can attach
>> a debugger and be able to dig into OMPI's internals yourself.
>> 
> 
> There is a debuginfo package.
> 
> Since I removed all of fedora's openmpi packages and installed from
> source into /opt/openmpi-1.6.5 and /opt/openmpi-1.6.5_hwloc-1.4.3 to
> narrow down on this problem, I now have to re-install the rpms with yum.
> 
> sudo yum install openmpi openmpi-devel openmpi-debuginfo
> 
> These don't put anything into my PATH or LD_LIBRARY_PATH so I have to :
> 
> module load mpi/openmpi-x86_64
> 
> I compiled my simple program with :
> 
> mpicc -g -o mpi_simple mpi_simple.c
> 
> The program links to fedora's copies of the libraries of interest :
> 
> mpirun -n 1 ldd mpi_simple | grep hwloc
>  libhwloc.so.5 => /lib64/libhwloc.so.5 (0x003c5760)
> mpirun -n 1 ldd mpi_simple | grep mpi
>  libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x7f7207e29000)
> 
> I started the debugger with :
> 
> mpirun -n 1 gdb mpi_simple
> 
> When run in the debugger I got the error I described.
> 
> I reran and in gdb did :
> 
> set breakpoint pending on
> break util/nidmap.c:146
> run
> step
> 
> took me into 'opal_dss_unpack' Then I did 'next' until I got passed
> 'opal_dss_unpack_buffer' which returned the -1 we see outside.
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Kevin H. Hobbs
On 07/23/2013 02:22 PM, Ralph Castain wrote:
> Yeah, it's failing when trying to unpack the topology obtained from hwloc.

What I find very interesting is that the hwloc configure options
--disable-cairo --disable-libxml2 turn the bug off.

I'll keep walking through the execution in gdb maybe I'll be able to
narrow it down some more.






signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Ralph Castain
That's understandable - if you don't disable xml2, then hwloc uses the xml2 
library to do the topology encoding. We rely on their internal "quasi-xml" 
encoding method, which I believe provides some different data (and definitely 
different format). I suspect this is causing the confusion, though it is 
strange since we have them encode and decode it - you would think the routines 
come from the same place and would be compatible.

Could be something hwloc needs to look at.

On Jul 23, 2013, at 11:45 AM, "Kevin H. Hobbs"  wrote:

> On 07/23/2013 02:22 PM, Ralph Castain wrote:
>> Yeah, it's failing when trying to unpack the topology obtained from hwloc.
> 
> What I find very interesting is that the hwloc configure options
> --disable-cairo --disable-libxml2 turn the bug off.
> 
> I'll keep walking through the execution in gdb maybe I'll be able to
> narrow it down some more.
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-23 Thread Kevin H. Hobbs
On 07/23/2013 02:22 PM, Ralph Castain wrote:
> Yeah, it's failing when trying to unpack the topology obtained from
> hwloc. My guess is that one of the following calls changed in
> hwloc-1.4.3:
> 

It appears to be this one.

hwloc_topology_set_xmlbuffer

I'll return what I've gathered so far to the Fedora bug, and maybe look
at what goes on in hwloc.





signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] After OS Update MPI_Init fails on one host

2013-07-26 Thread Dave Love
"Kevin H. Hobbs"  writes:

> The program links to fedora's copies of the libraries of interest :
>
> mpirun -n 1 ldd mpi_simple | grep hwloc
>   libhwloc.so.5 => /lib64/libhwloc.so.5 (0x003c5760)

[I'm surprised it's in /lib64.]

> mpirun -n 1 ldd mpi_simple | grep mpi
>   libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x7f7207e29000)

I'm on RH6, not Fedora, but I'm successfully using a 1.6.5 package built
with a modified (for extra config options) Fedora spec file.  It
dynamically links against an hwloc 1.7 package (not the original system
hwloc version), but I don't know if that makes a difference.

I agree about the value of OS packages and libraries -- money where
mouth is, maintaining packaging for SGE against hwloc etc.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/