[OMPI devel] running Open MPI with different install paths

2015-04-17 Thread Gilles Gouaillardet

Folks,

i am trying to run heterogeneous Open MPI.
all my nodes use NFS everything is shared, so i need to manually specify
x86_64 nodes must use /.../ompi-x86_64 and sparcv9 nodes must use 
/.../ompi-sparcv9


is there a simple way to achieve this ?

Cheers,

Gilles


Re: [OMPI devel] running Open MPI with different install paths

2015-04-17 Thread Ralph Castain
Hi Gilles

What launch environment? We don't currently have a simple way of doing this
outside of ensuring the paths on those nodes point to the correct default
place (i.e., you can't use prefix). However, it might be possible to add
such support if we knew which nodes were what type. Unfortunately, we would
need to know that prior to launching the daemons, so we can't self-discover
it.


On Fri, Apr 17, 2015 at 2:32 AM, Gilles Gouaillardet 
wrote:

> Folks,
>
> i am trying to run heterogeneous Open MPI.
> all my nodes use NFS everything is shared, so i need to manually specify
> x86_64 nodes must use /.../ompi-x86_64 and sparcv9 nodes must use
> /.../ompi-sparcv9
>
> is there a simple way to achieve this ?
>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17237.php
>


Re: [OMPI devel] running Open MPI with different install paths

2015-04-17 Thread Jeff Squyres (jsquyres)
Back in the days when I worked on heterogeneous machines like this, I had logic 
in my shell startup files to set paths properly.  E.g. (pseudocode):

-
arch=`config.guess`
switch $arch:
case *x86_64-linux*)
  prefix_path=$HOME/x86_64-linux-stuff/bin
  prefix_ldpath=$HOME/x86_64-linux-stuff/lib
  prefix_manpath=$HOME/x86_64-linux-stuff/share/man
  ;;
case *sparc*)
  prefix_path=$HOME/sparc/bin
  prefix_ldpath=$HOME/sparc/lib
  prefix_manpath=$HOME/sparc/share/man
  ;;
...etc.

export PATH=$prefix_path:$PATH
export LD_LIBRARY_PATH=$prefix_ldpath:$LD_LIBRARY_PATH
export MANPAHE=$prefix_manpath:$MANPATH
-


> On Apr 17, 2015, at 5:34 AM, Ralph Castain  wrote:
> 
> Hi Gilles
> 
> What launch environment? We don't currently have a simple way of doing this 
> outside of ensuring the paths on those nodes point to the correct default 
> place (i.e., you can't use prefix). However, it might be possible to add such 
> support if we knew which nodes were what type. Unfortunately, we would need 
> to know that prior to launching the daemons, so we can't self-discover it.
> 
> 
> On Fri, Apr 17, 2015 at 2:32 AM, Gilles Gouaillardet  
> wrote:
> Folks,
> 
> i am trying to run heterogeneous Open MPI.
> all my nodes use NFS everything is shared, so i need to manually specify
> x86_64 nodes must use /.../ompi-x86_64 and sparcv9 nodes must use 
> /.../ompi-sparcv9
> 
> is there a simple way to achieve this ?
> 
> Cheers,
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17237.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17238.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] VERSION numbers for v1.8.5

2015-04-17 Thread Jeff Squyres (jsquyres)
I reviewed the v1.8 logs and I think that this is what the shared library 
version numbers should be.  Essentially: most have minor code changes, meaning 
that they should get an revision bump (the "r" in c:r:a).  And the rest should 
get also get a revision bump because mutex.h changed.  This latter part might 
be a little overkill, but I think it's harmless (and I don't feel like checking 
closely to see if mutex.h directly affects each shared library that didn't have 
other code changes).

Can someone please double check these?

-
# JMS Multiple changes in OMPI code: was 7:0:6
libmpi_so_version=7:1:6
libmpi_cxx_so_version=2:3:1
# JMS Fortran code changed: was 7:0:5
libmpi_mpifh_so_version=7:1:5
# JMS Fortran code changed: was 5:0:4
libmpi_usempi_tkr_so_version=5:1:4
# JMS Fortran code changed: was 1:0:1
libmpi_usempi_ignore_tkr_so_version=1:1:1
# JMS Fortran code changed: was 6:0:6
libmpi_usempif08_so_version=6:1:6
# JMS ORTE code changed: was 7:5:0
libopen_rte_so_version=7:6:0
# JMS Multiple changes in OPAL code: was 8:1:2
libopen_pal_so_version=8:2:2
libmpi_java_so_version=3:0:2
# JMS Oshmem code changed: was 3:1:0
liboshmem_so_version=3:2:0

# OMPI layer
# JMS cuda code changed: was 1:7:0
libmca_common_cuda_so_version=1:8:0
# JMS Changed opal/thread/mutex.h: was 2:5:0
libmca_common_mx_so_version=2:6:0
# JMS Changed opal/thread/mutex.h: was 4:4:0
libmca_common_sm_so_version=4:5:0
# JMS Changed opal/thread/mutex.h: was 2:1:2
libmca_common_ugni_so_version=2:2:2
# JMS Changed opal/thread/mutex.h: was 2:3:2
libmca_common_verbs_so_version=2:4:2

# OPAL layer
# JMS Changed opal/thread/mutex.h: was 2:3:1
libmca_opal_common_pmi_so_version=2:4:1
-

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Jeff Squyres (jsquyres)
The v1.8 branch NEWS, README, and VERSION files have been updated in 
preparation for the v1.8.5 release.  Please double check them -- especially 
NEWS, particularly to ensure that we are giving credit to users who submitted 
bug reports, etc.

Also, please double check that this is a current/correct list of supported 
systems:

- The run-time systems that are currently supported are:
  - rsh / ssh
  - LoadLeveler
  - PBS Pro, Torque
  - Platform LSF (v7.0.2 and later)
  - SLURM
  - Cray XT-3, XT-4, and XT-5
  - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine

- Systems that have been tested are:
  - Linux (various flavors/distros), 32 bit, with gcc
  - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
Intel, and Portland (*)
  - OS X (10.6, 10.7, 10.8, 10.9), 32 and 64 bit (x86_64), with gcc and
Absoft compilers (*)

  (*) Be sure to read the Compiler Notes, below.

- Other systems have been lightly (but not fully tested):
  - Cygwin 32 & 64 bit with gcc
  - ARMv4, ARMv5, ARMv6, ARMv7 (when using non-inline assembly; only
ARMv7 is fully supported when -DOMPI_DISABLE_INLINE_ASM is used).
  - Other 64 bit platforms (e.g., Linux on PPC64)
  - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
with Oracle Solaris Studio 12.2 and 12.3

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] running Open MPI with different install paths

2015-04-17 Thread Gilles Gouaillardet
Ralph,

right now, I am using ssh

one way to go is to extend the machine file syntax
instead of
user@host
we could have
user@host:port//prefix

an other way would be to do this on the command line :
mpirun --host host1 --prefix prefix1 a.out : -- host host2 --prefix prefix2
b.out

an other really nice feature would be to have a suffix per arch, intel MPI
is doing this for mic :
mpirun --host cpu,mic a.out
a.out runs on the Xeon and a.out.mic runs on the Xeon phi

for the time being, I use a very cheap hack :
replace orted with a shell scripts that exec the right binary,
and replace a.out with a wrapper that
1. sets LD_LIBRARY_PATH according to the right arch
since open MPI sets this for the arch on which mpirun is invoked, which
might not be the expected one
2. invoke the right binary for this arch

an other really nice feature would be to have mpirun invoke this wrapper
"under the hood" :

mpirun --wrapper wrapper.sh a.out
would invoke
wrapper.sh a.out
on all the nodes, and in this script, I can manually execs a.out from the
right path (e.g. one path per arch)

an alternative to extending the machine file syntax would be to invoke an
other wrapper for orted :
mpirun --orted-wrapper ortedwrapper.sh --wrapper binwrapper.sh a.out
that would allow the end user to "implement" self discovery

cheers,

Gilles

On Friday, April 17, 2015, Ralph Castain  wrote:

> Hi Gilles
>
> What launch environment? We don't currently have a simple way of doing
> this outside of ensuring the paths on those nodes point to the correct
> default place (i.e., you can't use prefix). However, it might be possible
> to add such support if we knew which nodes were what type. Unfortunately,
> we would need to know that prior to launching the daemons, so we can't
> self-discover it.
>
>
> On Fri, Apr 17, 2015 at 2:32 AM, Gilles Gouaillardet  > wrote:
>
>> Folks,
>>
>> i am trying to run heterogeneous Open MPI.
>> all my nodes use NFS everything is shared, so i need to manually specify
>> x86_64 nodes must use /.../ompi-x86_64 and sparcv9 nodes must use
>> /.../ompi-sparcv9
>>
>> is there a simple way to achieve this ?
>>
>> Cheers,
>>
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/04/17237.php
>>
>
>


Re: [OMPI devel] running Open MPI with different install paths

2015-04-17 Thread Jeff Squyres (jsquyres)
Will these kinds of things work in all launchers, or just ssh?

I'm a little uncomfortable with going to extraordinary measures for a fairly 
uncommon scenario, especially when there are mechanisms that already exist that 
are designed for this kind of use case (i.e., logic in shell login/startup 
files).

That being said, if there's a core community member who has an itch to scratch 
(e.g., because said core community member runs into this case all the time :-) 
), or if this is poised to become a more common user scenario, then I think 
patches would be gratefully accepted.  :-) :-) :-)



> On Apr 17, 2015, at 8:58 AM, Gilles Gouaillardet 
>  wrote:
> 
> Ralph,
> 
> right now, I am using ssh
> 
> one way to go is to extend the machine file syntax
> instead of
> user@host
> we could have
> user@host:port//prefix
> 
> an other way would be to do this on the command line :
> mpirun --host host1 --prefix prefix1 a.out : -- host host2 --prefix prefix2 
> b.out
> 
> an other really nice feature would be to have a suffix per arch, intel MPI is 
> doing this for mic :
> mpirun --host cpu,mic a.out 
> a.out runs on the Xeon and a.out.mic runs on the Xeon phi
> 
> for the time being, I use a very cheap hack :
> replace orted with a shell scripts that exec the right binary,
> and replace a.out with a wrapper that
> 1. sets LD_LIBRARY_PATH according to the right arch
> since open MPI sets this for the arch on which mpirun is invoked, which might 
> not be the expected one
> 2. invoke the right binary for this arch
> 
> an other really nice feature would be to have mpirun invoke this wrapper 
> "under the hood" :
> 
> mpirun --wrapper wrapper.sh a.out
> would invoke
> wrapper.sh a.out
> on all the nodes, and in this script, I can manually execs a.out from the 
> right path (e.g. one path per arch)
> 
> an alternative to extending the machine file syntax would be to invoke an 
> other wrapper for orted :
> mpirun --orted-wrapper ortedwrapper.sh --wrapper binwrapper.sh a.out
> that would allow the end user to "implement" self discovery
> 
> cheers,
> 
> Gilles 
> 
> On Friday, April 17, 2015, Ralph Castain  wrote:
> Hi Gilles
> 
> What launch environment? We don't currently have a simple way of doing this 
> outside of ensuring the paths on those nodes point to the correct default 
> place (i.e., you can't use prefix). However, it might be possible to add such 
> support if we knew which nodes were what type. Unfortunately, we would need 
> to know that prior to launching the daemons, so we can't self-discover it.
> 
> 
> On Fri, Apr 17, 2015 at 2:32 AM, Gilles Gouaillardet  
> wrote:
> Folks,
> 
> i am trying to run heterogeneous Open MPI.
> all my nodes use NFS everything is shared, so i need to manually specify
> x86_64 nodes must use /.../ompi-x86_64 and sparcv9 nodes must use 
> /.../ompi-sparcv9
> 
> is there a simple way to achieve this ?
> 
> Cheers,
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17237.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17242.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Howard Pritchard
Hi Jeff

Minor cray corrections below

On Apr 17, 2015 6:57 AM, "Jeff Squyres (jsquyres)" 
wrote:
>
> The v1.8 branch NEWS, README, and VERSION files have been updated in
preparation for the v1.8.5 release.  Please double check them -- especially
NEWS, particularly to ensure that we are giving credit to users who
submitted bug reports, etc.
>
> Also, please double check that this is a current/correct list of
supported systems:
>
> - The run-time systems that are currently supported are:
>   - rsh / ssh
>   - LoadLeveler
>   - PBS Pro, Torque
>   - Platform LSF (v7.0.2 and later)
>   - SLURM
>   - Cray XT-3, XT-4, and XT-5
delete line above replace with

Cray XE and XK

>   - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine
>
> - Systems that have been tested are:
>   - Linux (various flavors/distros), 32 bit, with gcc
>   - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
> Intel, and Portland (*)
>   - OS X (10.6, 10.7, 10.8, 10.9), 32 and 64 bit (x86_64), with gcc and
> Absoft compilers (*)
>
>   (*) Be sure to read the Compiler Notes, below.
>
> - Other systems have been lightly (but not fully tested):
>   - Cygwin 32 & 64 bit with gcc
>   - ARMv4, ARMv5, ARMv6, ARMv7 (when using non-inline assembly; only
> ARMv7 is fully supported when -DOMPI_DISABLE_INLINE_ASM is used).
>   - Other 64 bit platforms (e.g., Linux on PPC64)
>   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> with Oracle Solaris Studio 12.2 and 12.3
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/04/17241.php


Re: [OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Jeff Squyres (jsquyres)
Howard --

I notice that you have

  - Cray XE and XC

on the master README.

Which is correct for v1.8.5: XC or XK?


> On Apr 17, 2015, at 10:02 AM, Howard Pritchard  wrote:
> 
> Hi Jeff
> 
> Minor cray corrections below
> 
> On Apr 17, 2015 6:57 AM, "Jeff Squyres (jsquyres)"  wrote:
> >
> > The v1.8 branch NEWS, README, and VERSION files have been updated in 
> > preparation for the v1.8.5 release.  Please double check them -- especially 
> > NEWS, particularly to ensure that we are giving credit to users who 
> > submitted bug reports, etc.
> >
> > Also, please double check that this is a current/correct list of supported 
> > systems:
> >
> > - The run-time systems that are currently supported are:
> >   - rsh / ssh
> >   - LoadLeveler
> >   - PBS Pro, Torque
> >   - Platform LSF (v7.0.2 and later)
> >   - SLURM
> >   - Cray XT-3, XT-4, and XT-5
> delete line above replace with
> 
> Cray XE and XK
> 
> >   - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine
> >
> > - Systems that have been tested are:
> >   - Linux (various flavors/distros), 32 bit, with gcc
> >   - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
> > Intel, and Portland (*)
> >   - OS X (10.6, 10.7, 10.8, 10.9), 32 and 64 bit (x86_64), with gcc and
> > Absoft compilers (*)
> >
> >   (*) Be sure to read the Compiler Notes, below.
> >
> > - Other systems have been lightly (but not fully tested):
> >   - Cygwin 32 & 64 bit with gcc
> >   - ARMv4, ARMv5, ARMv6, ARMv7 (when using non-inline assembly; only
> > ARMv7 is fully supported when -DOMPI_DISABLE_INLINE_ASM is used).
> >   - Other 64 bit platforms (e.g., Linux on PPC64)
> >   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> > with Oracle Solaris Studio 12.2 and 12.3
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/04/17241.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17244.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] mtt failures from last nite

2015-04-17 Thread Howard Pritchard
HI Folks,

I'm seeing build failures on both carver/pgi at nersc and on a cray
internal machine
with the nightly build of master.

>From the cray box:

ommon_ugni.c:30:5: error: 'MCA_BASE_VERSION_2_0_0' undeclared here
(not in a function)
 MCA_BASE_VERSION_2_0_0,
common_ugni.c:31:5: warning: initialization makes integer from pointer
without a cast [enabled by
default]
 "common",
 ^
common_ugni.c:31:5: warning: (near initialization for
'opal_common_ugni_component.mca_minor_version') [enabled by default]
common_ugni.c:31:5: error: initializer element is not computable at load time
common_ugni.c:35:5: warning: initialization makes integer from pointer
without a cast [enabled by
default]
 NULL,
common_ugni.c:35:5: warning: (near initialization for
'opal_common_ugni_component.mca_project_minor_version') [enabled by default]
common_ugni.c:37:1: warning: initialization makes integer from pointer
without a cast [enabled by
default]
)
 ^
make[2]: *** [common_ugni.lo] Error 1
make[2]: *** Waiting for unfinished jobs
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1


from the ibm dataplex box:

./../config/depcomp: line 427: atomic-asm.d: No such file or directory
../../config/depcomp: line 430: atomic-asm.d: No such file or directory
pgcc-Fatal-/global/common/carver/usg/pgi/14.4/linux86-64/14.4/bin/pgc
TERMINATED by signal 11
Arguments to /global/common/carver/usg/pgi/14.4/linux86-64/14.4/bin/pgc
 -inform warn -x 119 0xa1 -x 122 0x40 -x 123 0x1000 -x 127 4 -x
127 17 -x 19 0x40 -x 28
0x4 -x 120 0x1000 -x 70 0x8000 -x 122 1 -x 125 0x2 -x 117
0x1000 -quad -x 59 4 -tp
nehalem -astype 0 -stdinc
/global/common/carver/usg/pgi/14.4/linux86-64/14.4/include-gcc41:/global/common/carver/usg/pgi/14.4/linux86-64/14.4/include:/usr/local/include:/usr/lib/gcc/x86_64-redhat-linux/4.1.2/include:/usr/include
-def unix -def __unix -def __unix__ -def linux -def __linux -def
__linux__ -def __NO_MATH_INLINES
-def __x86_64 -def __x86_64__ -def __LONG_MAX__=9223372036854775807L
-def '__SIZE_TYPE__=unsigned
long int' -def '__PTRDIFF_TYPE__=long int' -def __THROW= -def
__extension__= -def __amd_64__amd64__
-def __k8 -def __k8__ -def __SSE__ -def __MMX__ -def __SSE2__ -def
__SSE3__ -def __SSSE3__
-predicate '#machine(x86_64) #lint(off) #system(posix) #cpu(x86_64)'
-idir . -idir
../../../../opal/include -idir ../../../../ompi/include -idir
../../../../oshmem/include -idir
../../../../opal/mca/common/libfabric/libfabric -idir
../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen -idir
../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen -idir
./libevent -idir
./libevent/include -idir ./libevent/include -idir ./libevent/compat
-idir ../../../.. -idir
../../../../orte/include -idir
/global/u2/h/hpp/mtt_carver_tmp/mpi-install/8_8A/src/openmpi-dev-1527-g97444d8/opal/mca/hwloc/hwloc191/hwloc/include
-idir
/global/u2/h/hpp/mtt_carver_tmp/mpi-install/8_8A/src/openmpi-dev-1527-g97444d8/opal/mca/event/libevent2022/libevent
-idir
/global/u2/h/hpp/mtt_carver_tmp/mpi-install/8_8A/src/openmpi-dev-1527-g97444d8/opal/mca/event/libevent2022/libevent/include
-def HAVE_CONFIG_H -def __PIC__ -def PIC -cmdline '+pgcc
libevent2022_component.c -DHAVE_CONFIG_H
-I. -I../../../../opal/include -I../../../../ompi/include
-I../../../../oshmem/include
-I../../../../opal/mca/common/libfabric/libfabric
-I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen
-I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen
-I./libevent -I./libevent/include
-I./libevent/include -I./libevent/compat -I../../../..
-I../../../../orte/include
-I/global/u2/h/hpp/mtt_carver_tmp/mpi-install/8_8A/src/openmpi-dev-1527-g97444d8/opal/mca/hwloc/hwloc191/hwloc/include
-I/global/u2/h/hpp/mtt_carver_tmp/mpi-install/8_8A/src/openmpi-dev-1527-g97444d8/opal/mca/event/libevent2022/libevent
-I/global/u2/h/hpp/mtt_carver_tmp/mpi-install/8_8A/src/openmpi-dev-1527-g97444d8/opal/mca/event/libevent2022/libevent/include
-g -c -MD -fpic -DPIC -o .libs/libevent2022_component.o' -outfile
.libs/libevent2022_component.o -x
123 0x8000 -x 123 4 -x 2 0x400 -x 119 0x20 -def __pgnu_vsn=40102
-alwaysinline
/global/common/carver/usg/pgi/14.4/linux86-64/14.4/lib/libintrinsics.il
4 -x 123 8 -x 120 0x20
-x 70 0x4000 -x 163 0x80 -y 189 0x400 -x 62 8 -asm
/global/scratch2/sd/hpp/pgccFq0dD9_z_POK.s
make[3]: *** [libevent2022_component.lo] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1


Re: [OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Howard Pritchard
Hi Jeff,

Kind of sad but I don't want to sign up for XC support for 1.8.5.

Cray XK is just an XE but with one opteron socket/node removed and replaced
with an
nvidia GPU on a daughter card, so I"m willing to sign up for supporting
that.

So on master README say we support
Cray XE, XK, and XC systems

On the 1.8.5 README say we support
Cray XE and XK systems

Actually the 1.8.X branch will no longer build on cray owing to pmi issues.
But this late in the release cycle for 1.8.X I'd prefer not to make more
changes
in this area of orte.  Its more important that the 1.8.X branch work well
for
the slurm/pmi systems at trilabs than for the Cray's.

I strongly encourage anyone wanting to use open mpi on cray systems
to use master (on good days, today is not such a day) at this point in time.

Sorry for the confusion.

Howard


2015-04-17 8:18 GMT-06:00 Jeff Squyres (jsquyres) :

> Howard --
>
> I notice that you have
>
>   - Cray XE and XC
>
> on the master README.
>
> Which is correct for v1.8.5: XC or XK?
>
>
> > On Apr 17, 2015, at 10:02 AM, Howard Pritchard 
> wrote:
> >
> > Hi Jeff
> >
> > Minor cray corrections below
> >
> > On Apr 17, 2015 6:57 AM, "Jeff Squyres (jsquyres)" 
> wrote:
> > >
> > > The v1.8 branch NEWS, README, and VERSION files have been updated in
> preparation for the v1.8.5 release.  Please double check them -- especially
> NEWS, particularly to ensure that we are giving credit to users who
> submitted bug reports, etc.
> > >
> > > Also, please double check that this is a current/correct list of
> supported systems:
> > >
> > > - The run-time systems that are currently supported are:
> > >   - rsh / ssh
> > >   - LoadLeveler
> > >   - PBS Pro, Torque
> > >   - Platform LSF (v7.0.2 and later)
> > >   - SLURM
> > >   - Cray XT-3, XT-4, and XT-5
> > delete line above replace with
> >
> > Cray XE and XK
> >
> > >   - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine
> > >
> > > - Systems that have been tested are:
> > >   - Linux (various flavors/distros), 32 bit, with gcc
> > >   - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
> > > Intel, and Portland (*)
> > >   - OS X (10.6, 10.7, 10.8, 10.9), 32 and 64 bit (x86_64), with gcc and
> > > Absoft compilers (*)
> > >
> > >   (*) Be sure to read the Compiler Notes, below.
> > >
> > > - Other systems have been lightly (but not fully tested):
> > >   - Cygwin 32 & 64 bit with gcc
> > >   - ARMv4, ARMv5, ARMv6, ARMv7 (when using non-inline assembly; only
> > > ARMv7 is fully supported when -DOMPI_DISABLE_INLINE_ASM is used).
> > >   - Other 64 bit platforms (e.g., Linux on PPC64)
> > >   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> > > with Oracle Solaris Studio 12.2 and 12.3
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17241.php
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17244.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17245.php
>


Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

2015-04-17 Thread Tom Wurgler
Ok, seems like I am making some progress here.  Thanks for the help.
I turned HT off.
Now I can run v 1.4.2, 1.6.4 and 1.8.4 all compiled the same compiler and run 
on the same machine
1.4.2 runs this job in 59 minutes.   1.6.4 and 1.8.4 run the job in 1hr 24 
minutes.
1.4.2 uses just --mca paffinuty-alone 1 and the processes are bound
  PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4
 N5 ]
22232 prog1 0469.9M [ 469.9M 0  0  0  0 
 0  ]
22233 prog1 1479.0M [   4.0M 475.0M 0  0  0 
 0  ]
22234 prog1 2516.7M [ 516.7M 0  0  0  0 
 0  ]
22235 prog1 3485.4M [   8.0M 477.4M 0  0  0 
 0  ]
22236 prog1 4482.6M [ 482.6M 0  0  0  0 
 0  ]
22237 prog1 5486.6M [   6.0M 480.6M 0  0  0 
 0  ]
22238 prog1 6481.3M [ 481.3M 0  0  0  0 
 0  ]
22239 prog1 7419.4M [   8.0M 411.4M 0  0  0 
 0  ]

If I use 1.6.4 and 1.8.4 with --mca paffinity-alone 1, the run time is now 1hr 
14 minutes.  The process map now looks like:
bash-4.3# numa-maps -n eagle
  PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4
 N5 ]
12248 eagle 0163.3M [ 155.3M   8.0M 0  0  0 
 0  ]
12249 eagle 2161.6M [ 159.6M   2.0M 0  0  0 
 0  ]
12250 eagle 4164.3M [ 160.3M   4.0M 0  0  0 
 0  ]
12251 eagle 6160.4M [ 156.4M   4.0M 0  0  0 
 0  ]
12252 eagle 8160.6M [ 154.6M   6.0M 0  0  0 
 0  ]
12253 eagle10159.8M [ 151.8M   8.0M 0  0  0 
 0  ]
12254 eagle12160.9M [ 152.9M   8.0M 0  0  0 
 0  ]
12255 eagle14159.8M [ 157.8M   2.0M 0  0  0 
 0  ]

If I take off the --mca paffinity-alone 1, and instead use --bysocket 
--bind-to-core (1.6.4)  or --map-by socket --bind-to core (1.8.4), the job runs 
in 59 minutes and the process map look like the 1.4.2 one above...looks super!

Now the issue:

If I move the same openmi install dirs to our cluster nodes, I can run 1.64+ 
using the --mca paffinity-alone 1 options and the job runs (taking longer etc).

If I then try the --bysocket --bind-to-core etc, I get the following error:

--
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--
--
mpirun was unable to start the specified application as it encountered an error:

Error name: Input/output error
Node: rdsargo36

when attempting to start process rank 0.
--
Error: Previous command failed (exitcode=1

Now the original runs were done on an Intel box (and this is where OpenMPI was 
comilped).
I am trying to run now on an AMD based cluster node.

So --mca paffinity-alone 1  works
 --bysocket --bind-to-core doesn't.

Does this make sense to you folks?  If the AMD (running SuSE 11.1, BTW) doesn't 
support paffinity, why does the --mca version run?  Is there some way to 
check/set whether a system would support --bysocket etc?  Does it matter which 
machine I compiled on?

And compare the following:

[test_lsf2]rds4020[1010]% /apps/share/openmpi/1.6.4.I1404211/bin/ompi_info | 
grep -i affinity
  MPI extensions: affinity example
   MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6.4)
   MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)

[test_lsf2]rds4020[1010]% /apps/share/openmpi/1.4.2.I1404211/bin/ompi_info | 
grep -i affinity
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.2)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.2)
   MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4.2)

[test_lsf2]rds4020[1012]% /apps/share/openmpi/1.8.4.I1404211/bin/ompi_info | 
grep -i affinity
(no output)

Shouldn't the 1.8.4 version show something?

Thank again for the help so far and appreciate any comments/help on the above.
tom

From: devel  on behalf of Ralph Castain 

Sent: Friday, April 10, 2015 11:38 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

Your configure options look fine.

Getting 1 process assigned to each core (irrespective of HT on or off

Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

2015-04-17 Thread Tom Wurgler
Note where I said "1 hour 14 minutes" it should have read "1 hour 24 minutes"...




From: Tom Wurgler
Sent: Friday, April 17, 2015 2:14 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

Ok, seems like I am making some progress here.  Thanks for the help.
I turned HT off.
Now I can run v 1.4.2, 1.6.4 and 1.8.4 all compiled the same compiler and run 
on the same machine
1.4.2 runs this job in 59 minutes.   1.6.4 and 1.8.4 run the job in 1hr 24 
minutes.
1.4.2 uses just --mca paffinuty-alone 1 and the processes are bound
  PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4
 N5 ]
22232 prog1 0469.9M [ 469.9M 0  0  0  0 
 0  ]
22233 prog1 1479.0M [   4.0M 475.0M 0  0  0 
 0  ]
22234 prog1 2516.7M [ 516.7M 0  0  0  0 
 0  ]
22235 prog1 3485.4M [   8.0M 477.4M 0  0  0 
 0  ]
22236 prog1 4482.6M [ 482.6M 0  0  0  0 
 0  ]
22237 prog1 5486.6M [   6.0M 480.6M 0  0  0 
 0  ]
22238 prog1 6481.3M [ 481.3M 0  0  0  0 
 0  ]
22239 prog1 7419.4M [   8.0M 411.4M 0  0  0 
 0  ]

If I use 1.6.4 and 1.8.4 with --mca paffinity-alone 1, the run time is now 1hr 
14 minutes.  The process map now looks like:
bash-4.3# numa-maps -n eagle
  PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4
 N5 ]
12248 eagle 0163.3M [ 155.3M   8.0M 0  0  0 
 0  ]
12249 eagle 2161.6M [ 159.6M   2.0M 0  0  0 
 0  ]
12250 eagle 4164.3M [ 160.3M   4.0M 0  0  0 
 0  ]
12251 eagle 6160.4M [ 156.4M   4.0M 0  0  0 
 0  ]
12252 eagle 8160.6M [ 154.6M   6.0M 0  0  0 
 0  ]
12253 eagle10159.8M [ 151.8M   8.0M 0  0  0 
 0  ]
12254 eagle12160.9M [ 152.9M   8.0M 0  0  0 
 0  ]
12255 eagle14159.8M [ 157.8M   2.0M 0  0  0 
 0  ]

If I take off the --mca paffinity-alone 1, and instead use --bysocket 
--bind-to-core (1.6.4)  or --map-by socket --bind-to core (1.8.4), the job runs 
in 59 minutes and the process map look like the 1.4.2 one above...looks super!

Now the issue:

If I move the same openmi install dirs to our cluster nodes, I can run 1.64+ 
using the --mca paffinity-alone 1 options and the job runs (taking longer etc).

If I then try the --bysocket --bind-to-core etc, I get the following error:

--
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--
--
mpirun was unable to start the specified application as it encountered an error:

Error name: Input/output error
Node: rdsargo36

when attempting to start process rank 0.
--
Error: Previous command failed (exitcode=1

Now the original runs were done on an Intel box (and this is where OpenMPI was 
comilped).
I am trying to run now on an AMD based cluster node.

So --mca paffinity-alone 1  works
 --bysocket --bind-to-core doesn't.

Does this make sense to you folks?  If the AMD (running SuSE 11.1, BTW) doesn't 
support paffinity, why does the --mca version run?  Is there some way to 
check/set whether a system would support --bysocket etc?  Does it matter which 
machine I compiled on?

And compare the following:

[test_lsf2]rds4020[1010]% /apps/share/openmpi/1.6.4.I1404211/bin/ompi_info | 
grep -i affinity
  MPI extensions: affinity example
   MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6.4)
   MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)

[test_lsf2]rds4020[1010]% /apps/share/openmpi/1.4.2.I1404211/bin/ompi_info | 
grep -i affinity
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.2)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.2)
   MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4.2)

[test_lsf2]rds4020[1012]% /apps/share/openmpi/1.8.4.I1404211/bin/ompi_info | 
grep -i affinity
(no output)

Shouldn't the 1.8.4 version show something?

Thank again for the help so far and appreciate any comments/help on the above.
tom

From: devel 

Re: [OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Paul Hargrove
On Fri, Apr 17, 2015 at 5:57 AM, Jeff Squyres (jsquyres)  wrote:

>   - OS X (10.6, 10.7, 10.8, 10.9), 32 and 64 bit (x86_64), with gcc and
> Absoft compilers (*)
>

Since about 10.7 (depending which XCode you installed), cc and c++ have
been Clang and Clang++ on Mac OS X.
The "gcc" is optional, and also since 10.7 has been "llvm-gcc" and thus not
really the "gcc" we all know (and love?).
Not sure how you want to spin that, if at all (or perhaps you all use a gcc
from fink, brew, etc.).

  - ARMv4, ARMv5, ARMv6, ARMv7 (when using non-inline assembly; only
> ARMv7 is fully supported when -DOMPI_DISABLE_INLINE_ASM is used).
>

Wasn't the inline asm problem fixed?
I ran ARMv5, v6 and v7 tests with 1.8.5rc1 and no special flags, and passed
"make check" w/o problems.


>   - Other 64 bit platforms (e.g., Linux on PPC64)
>

FYI: I've tested 1.8.5rc1 on Linux on ARMv8 (aka AARCH64) exactly once and
it passed (used gcc sync atomics by default).
Not to say that this belongs in the list yet.



>   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> with Oracle Solaris Studio 12.2 and 12.3


I believe Solaris Studio 12.4 now belongs on that list.


-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Ralph Castain
Guess I’m puzzled by the XC comment on 1.8.5 given that I’m working with at 
least one group that is using it on an XC system. If you don’t want to support 
it, I understand - but we should be clear that it may well work anyway.


> On Apr 17, 2015, at 9:28 AM, Howard Pritchard  wrote:
> 
> Hi Jeff,
> 
> Kind of sad but I don't want to sign up for XC support for 1.8.5.
> 
> Cray XK is just an XE but with one opteron socket/node removed and replaced 
> with an 
> nvidia GPU on a daughter card, so I"m willing to sign up for supporting
> that.
> 
> So on master README say we support 
> Cray XE, XK, and XC systems
> 
> On the 1.8.5 README say we support
> Cray XE and XK systems
> 
> Actually the 1.8.X branch will no longer build on cray owing to pmi issues.
> But this late in the release cycle for 1.8.X I'd prefer not to make more 
> changes
> in this area of orte.  Its more important that the 1.8.X branch work well for
> the slurm/pmi systems at trilabs than for the Cray's.
> 
> I strongly encourage anyone wanting to use open mpi on cray systems
> to use master (on good days, today is not such a day) at this point in time.
> 
> Sorry for the confusion.
> 
> Howard
> 
> 
> 2015-04-17 8:18 GMT-06:00 Jeff Squyres (jsquyres)  >:
> Howard --
> 
> I notice that you have
> 
>   - Cray XE and XC
> 
> on the master README.
> 
> Which is correct for v1.8.5: XC or XK?
> 
> 
> > On Apr 17, 2015, at 10:02 AM, Howard Pritchard  > > wrote:
> >
> > Hi Jeff
> >
> > Minor cray corrections below
> >
> > On Apr 17, 2015 6:57 AM, "Jeff Squyres (jsquyres)"  > > wrote:
> > >
> > > The v1.8 branch NEWS, README, and VERSION files have been updated in 
> > > preparation for the v1.8.5 release.  Please double check them -- 
> > > especially NEWS, particularly to ensure that we are giving credit to 
> > > users who submitted bug reports, etc.
> > >
> > > Also, please double check that this is a current/correct list of 
> > > supported systems:
> > >
> > > - The run-time systems that are currently supported are:
> > >   - rsh / ssh
> > >   - LoadLeveler
> > >   - PBS Pro, Torque
> > >   - Platform LSF (v7.0.2 and later)
> > >   - SLURM
> > >   - Cray XT-3, XT-4, and XT-5
> > delete line above replace with
> >
> > Cray XE and XK
> >
> > >   - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine
> > >
> > > - Systems that have been tested are:
> > >   - Linux (various flavors/distros), 32 bit, with gcc
> > >   - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
> > > Intel, and Portland (*)
> > >   - OS X (10.6, 10.7, 10.8, 10.9), 32 and 64 bit (x86_64), with gcc and
> > > Absoft compilers (*)
> > >
> > >   (*) Be sure to read the Compiler Notes, below.
> > >
> > > - Other systems have been lightly (but not fully tested):
> > >   - Cygwin 32 & 64 bit with gcc
> > >   - ARMv4, ARMv5, ARMv6, ARMv7 (when using non-inline assembly; only
> > > ARMv7 is fully supported when -DOMPI_DISABLE_INLINE_ASM is used).
> > >   - Other 64 bit platforms (e.g., Linux on PPC64)
> > >   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> > > with Oracle Solaris Studio 12.2 and 12.3
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com 
> > > For corporate legal information go to: 
> > > http://www.cisco.com/web/about/doing_business/legal/cri/ 
> > > 
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org 
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> > > 
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/devel/2015/04/17241.php 
> > > 
> > ___
> > devel mailing list
> > de...@open-mpi.org 
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> > 
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/04/17244.php 
> > 
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com 
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/ 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17245.php 
> 
> 
> 

Re: [OMPI devel] interaction with slurm 14.11

2015-04-17 Thread Ralph Castain
Hmmm…but what if a user -doesn’t- want their environment forwarded? Seems 
presumptuous of us to arbitrarily decide to do so on their behalf.


> On Apr 16, 2015, at 7:42 PM, David Singleton  
> wrote:
> 
> 
> Our site effectively runs all slurm jobs with sbatch --export=NONE ...  and 
> creates the necessary environment inside the batch script.  After upgading to 
> 14.11,  OpenMPI mpirun jobs hit
> 
> 2015-04-15T08:53:54+08:00 nod0138 slurmstepd[3122]: error: execve(): orted: 
> No such file or directory
> 
> The issue appears to be that, as of 14.11, srun now recognizes --export=NONE 
> and, more importantly, the SLURM_EXPORT_ENV=NONE set in the jobs environment 
> if you submit with sbatch --export=NONE .   The simple workaround is to unset 
> SLURM_EXPORT_ENV before mpirun.  Possibly mpirun should add --export=ALL to 
> its srun commands.
> 
> Cheers
> David
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17236.php



Re: [OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Paul Hargrove
On Fri, Apr 17, 2015 at 1:02 PM, Ralph Castain  wrote:
[...regarding Cray XC...]

> If you don't want to support it, I understand - but we should be clear
> that it may well work anyway.



Ralph,

Do you really want to enumerate all of the "it may well work anyway"
platforms?
If so, I have quite a list of them to offer :-)

Seriously, though, it sounds from Howard's comments like 1.8.5 on Cray XC
qualifies for some status like the "lightly (but not fully) tested"
platforms.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Ralph Castain
That’s where I was suggesting it be put as opposed to just removing it entirely

> On Apr 17, 2015, at 1:16 PM, Paul Hargrove  wrote:
> 
> 
> On Fri, Apr 17, 2015 at 1:02 PM, Ralph Castain  > wrote:
> [...regarding Cray XC...]
> If you don’t want to support it, I understand - but we should be clear that 
> it may well work anyway.
> 
> 
> Ralph,
> 
> Do you really want to enumerate all of the "it may well work anyway" 
> platforms?
> If so, I have quite a list of them to offer :-)
> 
> Seriously, though, it sounds from Howard's comments like 1.8.5 on Cray XC 
> qualifies for some status like the "lightly (but not fully) tested" platforms.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17253.php



Re: [OMPI devel] interaction with slurm 14.11

2015-04-17 Thread Paul Hargrove
Ralph,

I think David's concern is that because Slurm has changed their default
behavior, Open MPI's default behavior has changed as well.
The request (on which I have no opinion) appears to be that ORTE make an
explicit request for the behavior that was the previous default in Slurm.
That would ensure that the behavior of Open MPI remains independent of the
Slurm version.

David,

The problem here appears to be that the new (--export=NONE) behavior means
that $PATH and/or $LD_LIBRARY_PATH are not propagated, and thus orted could
not be found.
I believe you can configure Open MPI with --enable-mpirun-prefix-by-default
to resolve the reported "orted: No such file or directory"

-Paul

On Fri, Apr 17, 2015 at 1:13 PM, Ralph Castain  wrote:

> Hmmm...but what if a user -doesn't- want their environment forwarded? Seems
> presumptuous of us to arbitrarily decide to do so on their behalf.
>
>
> > On Apr 16, 2015, at 7:42 PM, David Singleton <
> david.b.single...@gmail.com> wrote:
> >
> >
> > Our site effectively runs all slurm jobs with sbatch --export=NONE ...
> and creates the necessary environment inside the batch script.  After
> upgading to 14.11,  OpenMPI mpirun jobs hit
> >
> > 2015-04-15T08:53:54+08:00 nod0138 slurmstepd[3122]: error: execve():
> orted: No such file or directory
> >
> > The issue appears to be that, as of 14.11, srun now recognizes
> --export=NONE and, more importantly, the SLURM_EXPORT_ENV=NONE set in the
> jobs environment if you submit with sbatch --export=NONE .   The simple
> workaround is to unset SLURM_EXPORT_ENV before mpirun.  Possibly mpirun
> should add --export=ALL to its srun commands.
> >
> > Cheers
> > David
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17236.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17252.php




-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Howard Pritchard
Right on Paul!

I can certainly get 1.8.5 to "work" on cray systems like hopper, but last
time I tried out of the box I had to fix up the pmi
in ess_pmi_module.c because with recent cray PMI's (like the ones now
default on hopper), the configure ends
up resulting in the use of PMI_KVS_Put/get, which for reasons too complex
to go into here, cray only allows to work
when an app is launched by hydra.  That was only the first of the
problems.  It might be that someone has gotten
it to work by linking in with an ancient version of cray pmi, etc. etc.

Again, I don't want to try and patch up the 1.8 release to work on the cray
systems.  Way too late in the game
for that branch.

But don't worry, for cory things will be running great using the 1.9/2.0.
Of course, maybe for cory beyond-mpi program
models might be more appropriate.

Howard




2015-04-17 14:16 GMT-06:00 Paul Hargrove :

>
> On Fri, Apr 17, 2015 at 1:02 PM, Ralph Castain  wrote:
> [...regarding Cray XC...]
>
>> If you don’t want to support it, I understand - but we should be clear
>> that it may well work anyway.
>
>
>
> Ralph,
>
> Do you really want to enumerate all of the "it may well work anyway"
> platforms?
> If so, I have quite a list of them to offer :-)
>
> Seriously, though, it sounds from Howard's comments like 1.8.5 on Cray XC
> qualifies for some status like the "lightly (but not fully) tested"
> platforms.
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17253.php
>


Re: [OMPI devel] interaction with slurm 14.11

2015-04-17 Thread Ralph Castain

> On Apr 17, 2015, at 1:27 PM, Paul Hargrove  wrote:
> 
> Ralph,
> 
> I think David's concern is that because Slurm has changed their default 
> behavior, Open MPI's default behavior has changed as well.
> The request (on which I have no opinion) appears to be that ORTE make an 
> explicit request for the behavior that was the previous default in Slurm.
> That would ensure that the behavior of Open MPI remains independent of the 
> Slurm version.

Understood. However, my comment stands. Open MPI’s behavior has not changed at 
all - Slurm’s has. I’m willing to give them their due that they made this 
decision for a valid reason, and I suspect communicated it in their NEWS in the 
release. I’m not about to override their settings under the covers.

As noted below and in the initial note, the users and sys admins can easily 
obtain the desired behavior by changing Slurm settings or using appropriate 
OMPI configure and/or cmd line options. I see no reason for making a change to 
the code.

HTH
Ralph

> 
> David,
> 
> The problem here appears to be that the new (--export=NONE) behavior means 
> that $PATH and/or $LD_LIBRARY_PATH are not propagated, and thus orted could 
> not be found.
> I believe you can configure Open MPI with --enable-mpirun-prefix-by-default 
> to resolve the reported "orted: No such file or directory"
> 
> -Paul
> 
> On Fri, Apr 17, 2015 at 1:13 PM, Ralph Castain  > wrote:
> Hmmm…but what if a user -doesn’t- want their environment forwarded? Seems 
> presumptuous of us to arbitrarily decide to do so on their behalf.
> 
> 
> > On Apr 16, 2015, at 7:42 PM, David Singleton  > > wrote:
> >
> >
> > Our site effectively runs all slurm jobs with sbatch --export=NONE ...  and 
> > creates the necessary environment inside the batch script.  After upgading 
> > to 14.11,  OpenMPI mpirun jobs hit
> >
> > 2015-04-15T08:53:54+08:00 nod0138 slurmstepd[3122]: error: execve(): orted: 
> > No such file or directory
> >
> > The issue appears to be that, as of 14.11, srun now recognizes 
> > --export=NONE and, more importantly, the SLURM_EXPORT_ENV=NONE set in the 
> > jobs environment if you submit with sbatch --export=NONE .   The simple 
> > workaround is to unset SLURM_EXPORT_ENV before mpirun.  Possibly mpirun 
> > should add --export=ALL to its srun commands.
> >
> > Cheers
> > David
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org 
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> > 
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/04/17236.php 
> > 
> 
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17252.php 
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17255.php



Re: [OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Jeff Squyres (jsquyres)
On Apr 17, 2015, at 2:19 PM, Paul Hargrove  wrote:
> 
>   - OS X (10.6, 10.7, 10.8, 10.9), 32 and 64 bit (x86_64), with gcc and
> Absoft compilers (*)
> 
> Since about 10.7 (depending which XCode you installed), cc and c++ have been 
> Clang and Clang++ on Mac OS X.  ...snip

Good point:

  - OS X (10.6, 10.7, 10.8, 10.9), 32 and 64 bit (x86_64), with XCode
and Absoft compilers (*)

>   - ARMv4, ARMv5, ARMv6, ARMv7 (when using non-inline assembly; only
> ARMv7 is fully supported when -DOMPI_DISABLE_INLINE_ASM is used).
> 
> Wasn't the inline asm problem fixed?
> I ran ARMv5, v6 and v7 tests with 1.8.5rc1 and no special flags, and passed 
> "make check" w/o problems.

Will remove it.

>  
>   - Other 64 bit platforms (e.g., Linux on PPC64)
> 
> FYI: I've tested 1.8.5rc1 on Linux on ARMv8 (aka AARCH64) exactly once and it 
> passed (used gcc sync atomics by default).
> Not to say that this belongs in the list yet.
> 
>  
>   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> with Oracle Solaris Studio 12.2 and 12.3
> 
> I believe Solaris Studio 12.4 now belongs on that list.

Done.

Thanks.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] v1.8.5 NEWS and README

2015-04-17 Thread Jeff Squyres (jsquyres)
On Apr 17, 2015, at 6:28 PM, Jeff Squyres (jsquyres)  wrote:
> 
>  - OS X (10.6, 10.7, 10.8, 10.9), 32 and 64 bit (x86_64), with XCode
>and Absoft compilers (*)

Actually, we should include 10.10 in there.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] interaction with slurm 14.11

2015-04-17 Thread David Singleton
On Sat, Apr 18, 2015 at 6:27 AM, Paul Hargrove  wrote:

>
> The problem here appears to be that the new (--export=NONE) behavior means
> that $PATH and/or $LD_LIBRARY_PATH are not propagated, and thus orted could
> not be found.
> I believe you can configure Open MPI with
> --enable-mpirun-prefix-by-default to resolve the reported "orted: No such
> file or directory"
>

It looks like --prefix, --enable-mpirun-prefix-by-default and fullpath
mpirun are all handled the same way - they just add to PATH and
LD_LIBRARY_PATH.  But since they are not propagated, this doesn't help.

David


Re: [OMPI devel] interaction with slurm 14.11

2015-04-17 Thread Ralph Castain

> On Apr 17, 2015, at 3:54 PM, David Singleton  
> wrote:
> 
> 
> 
> On Sat, Apr 18, 2015 at 6:27 AM, Paul Hargrove  > wrote:
> 
> The problem here appears to be that the new (--export=NONE) behavior means 
> that $PATH and/or $LD_LIBRARY_PATH are not propagated, and thus orted could 
> not be found.
> I believe you can configure Open MPI with --enable-mpirun-prefix-by-default 
> to resolve the reported "orted: No such file or directory"
> 
> It looks like --prefix, --enable-mpirun-prefix-by-default and fullpath mpirun 
> are all handled the same way - they just add to PATH and LD_LIBRARY_PATH.  
> But since they are not propagated, this doesn't help.

This is correct - they only get propagated in rsh/ssh environments as we then 
have a method for doing so. We can’t propagate them in Slurm - you either have 
to add the paths to your shell startup script, or ask Slurm to forward them for 
you.


> 
> David
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17260.php



Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

2015-04-17 Thread Ralph Castain
Hi Tom

Glad you are making some progress! Note that the 1.8 series uses hwloc for its 
affinity operations, while the 1.4 and 1.6 series used the old plpa code. 
Hence, you will not find the “affinity” components in the 1.8 ompi_info output.

Is there some reason you didn’t compile OMPI on the AMD machine? I ask because 
there are some config switches in various areas that differ between AMD and 
Intel architectures.


> On Apr 17, 2015, at 11:16 AM, Tom Wurgler  wrote:
> 
> Note where I said "1 hour 14 minutes" it should have read "1 hour 24 
> minutes"...
> 
> 
> 
> From: Tom Wurgler
> Sent: Friday, April 17, 2015 2:14 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4
>  
> Ok, seems like I am making some progress here.  Thanks for the help.
> I turned HT off.
> Now I can run v 1.4.2, 1.6.4 and 1.8.4 all compiled the same compiler and run 
> on the same machine
> 1.4.2 runs this job in 59 minutes.   1.6.4 and 1.8.4 run the job in 1hr 24 
> minutes.
> 1.4.2 uses just --mca paffinuty-alone 1 and the processes are bound
>   PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4  
>N5 ]
> 22232 prog1 0469.9M [ 469.9M 0  0  0  0   
>0  ]
> 22233 prog1 1479.0M [   4.0M 475.0M 0  0  0   
>0  ]
> 22234 prog1 2516.7M [ 516.7M 0  0  0  0   
>0  ]
> 22235 prog1 3485.4M [   8.0M 477.4M 0  0  0   
>0  ]
> 22236 prog1 4482.6M [ 482.6M 0  0  0  0   
>0  ]
> 22237 prog1 5486.6M [   6.0M 480.6M 0  0  0   
>0  ]
> 22238 prog1 6481.3M [ 481.3M 0  0  0  0   
>0  ]
> 22239 prog1 7419.4M [   8.0M 411.4M 0  0  0   
>0  ]
> 
> If I use 1.6.4 and 1.8.4 with --mca paffinity-alone 1, the run time is now 
> 1hr 14 minutes.  The process map now looks like:
> bash-4.3# numa-maps -n eagle
>   PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4  
>N5 ]
> 12248 eagle 0163.3M [ 155.3M   8.0M 0  0  0   
>0  ]
> 12249 eagle 2161.6M [ 159.6M   2.0M 0  0  0   
>0  ]
> 12250 eagle 4164.3M [ 160.3M   4.0M 0  0  0   
>0  ]
> 12251 eagle 6160.4M [ 156.4M   4.0M 0  0  0   
>0  ]
> 12252 eagle 8160.6M [ 154.6M   6.0M 0  0  0   
>0  ]
> 12253 eagle10159.8M [ 151.8M   8.0M 0  0  0   
>0  ]
> 12254 eagle12160.9M [ 152.9M   8.0M 0  0  0   
>0  ]
> 12255 eagle14159.8M [ 157.8M   2.0M 0  0  0   
>0  ]
> 
> If I take off the --mca paffinity-alone 1, and instead use --bysocket 
> --bind-to-core (1.6.4)  or --map-by socket --bind-to core (1.8.4), the job 
> runs in 59 minutes and the process map look like the 1.4.2 one above...looks 
> super!
> 
> Now the issue:
> 
> If I move the same openmi install dirs to our cluster nodes, I can run 1.64+ 
> using the --mca paffinity-alone 1 options and the job runs (taking longer 
> etc).
> 
> If I then try the --bysocket --bind-to-core etc, I get the following error:
> 
> --
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --
> --
> mpirun was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Input/output error
> Node: rdsargo36
> 
> when attempting to start process rank 0.
> --
> Error: Previous command failed (exitcode=1
> 
> Now the original runs were done on an Intel box (and this is where OpenMPI 
> was comilped).
> I am trying to run now on an AMD based cluster node.
> 
> So --mca paffinity-alone 1  works
>  --bysocket --bind-to-core doesn't.
> 
> Does this make sense to you folks?  If the AMD (running SuSE 11.1, BTW) 
> doesn't support paffinity, why does the --mca version run?  Is there some way 
> to check/set whether a system would support --bysocket etc?  Does it matter 
> which machine I compiled on?
> 
> And compare the following:
> 
> [test_lsf2]rds4020[1010]% /apps/share/openmpi/1.6.4.I1404211/bin/ompi_info | 
> grep -i affinity
>   MPI extensions: affinity example
>MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)
>MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6.4)
>