Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, buildhints?

2021-09-30 Thread Ray Muno via users
/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvc
  C compiler family name: PGI
  C compiler version: 21.7-0
C++ compiler: nvc++
   C++ compiler absolute: 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvc++
   Fort compiler: nvfortran
   Fort compiler abs: 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvfortran

NV-HPC Version 21.9
+++

ompi_info
 Package: Open MPI muno@loki.local Distribution
Open MPI: 4.1.1
  Open MPI repo revision: v4.1.1
   Open MPI release date: Apr 24, 2021
Open RTE: 4.1.1
  Open RTE repo revision: v4.1.1
   Open RTE release date: Apr 24, 2021
OPAL: 4.1.1
  OPAL repo revision: v4.1.1
   OPAL release date: Apr 24, 2021
 MPI API: 3.1.0
Ident string: 4.1.1
  Prefix: /stage/opt/OpenMPI/ROME/4.1.1/NV-HPC/21.9
 Configured architecture: x86_64-pc-linux-gnu
  Configure host: loki.local
   Configured by: muno
   Configured on: Thu Sep 30 17:59:36 UTC 2021
  Configure host: loki.local
  Configure command line: 'CC=nvc' 'CXX=nvc++' 'FC=nvfortran' 'FCFLAGS=-fPIC'
  '--enable-mca-no-build=op-avx'
  '--prefix=/stage/opt/OpenMPI/ROME/4.1.1/NV-HPC/21.9'
  '--with-libevent=internal'
  '--enable-mpi1-compatibility' '--without-xpmem'
  '--with-pmi' '--enable-mpi-cxx'
  '--with-hwloc=/stage/opt/HWLOC/2.5.0'
  '--with-hcoll=/opt/mellanox/hcoll'
  '--with-knem=/opt/knem-1.1.4.90mlnx1'
  
'--with-cuda=/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/cuda'
Built by: muno
Built on: Thu Sep 30 18:17:02 UTC 2021
  Built host: loki.local
  C bindings: yes
C++ bindings: yes
 Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
   Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
  limitations in the nvfortran compiler and/or Open
  MPI, does not support the following: array
  subsections, direct passthru (where possible) to
  underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
   Java bindings: no
  Wrapper compiler rpath: runpath
  C compiler: nvc
 C compiler absolute: 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvc
  C compiler family name: PGI
  C compiler version: 21.9-0
C++ compiler: nvc++
   C++ compiler absolute: 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvc++
   Fort compiler: nvfortran

On 9/30/21 12:18 PM, Ray Muno via users wrote:

OK, starting clean.

OS CentOS 7.9  (7.9.2009)
mlnxofed 5.4-1.0.3.0
UCX 1.11.0 (from mlnxofed)
hcoll-4.7.3199 (from mlnxofed)
knem-1.1.4.90  (from mlnxofed)
nVidia HPC-SDK  21.9
OpenMPI 4.1.1
HWLOC 2.5.0

Straight configure

configure CC=nvc CXX=nvc++  FC=nvfortran

dies in

  FCLD libmpi_usempif08.la
/usr/bin/ld: .libs/comm_spawn_multiple_f08.o: relocation R_X86_64_32S against `.rodata' can not be 
used when making a shared object; recompile with -fPIC


configure CC=nvc CXX=nvc++  FC=nvfortran FCFLAGS='-fPIC'

Fixes that, dies in

  CCLD mca_op_avx.la
./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multiple 
definition of `ompi_op_avx_functions_avx2'

./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0):
 first defined here

configure CC=nvc CXX=nvc++ FC=nvfortran FCFLAGS=-fPIC 
--enable-mca-no-build=op-avx

succeeds

And, working up to what I really want, I can build  (soemwhat emulating the 
HPC-X 2.9.0 build)

/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/configure CC=nvc CXX=nvc++ FC=nvfortran FCFLAGS=-fPIC 
--enable-mca-no-build=op-avx --prefix=/stage/opt/OpenMPI/ROME/4.1.1/NV-HPC/21.9 
--with-libevent=internal --enable-mpi1-compatibility --without-xpmem --with-pmi --enable-mpi-cxx 
--with-hwloc=/stage/opt/HWLOC/2.5.0 --with-hcoll=/opt/mellanox/hcoll 
--with-knem=/opt/knem-1.1.4.90mlnx1 --with-cuda=/stage/opt/NV_hpc_sdk/Linux_x86_64/21.9/cuda


adding

--with-platform=/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/contrib/platform/mellanox/optimized

brings back

/usr/bin/ld: final link failed: Bad value
make[2]: *** [libmpi_usempif08.la] Error 2

Since it overrides the FCFLAGS command line setting apparently.
Editing the file to add -fPIC to FCLAGS took care of that as did the FC='nvfortran -fPIC' (which is 
kludgey).



-Ray Muno

On 9/30/21 8:13 AM, Gilles Gouaillardet via users wrote:

  Ray,

there is a typo, the configure option is

--enable-mca-no-build=op-avx

Cheers,

Gilles

- Original Message

Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, buildhints?

2021-09-30 Thread Ray Muno via users

OK, starting clean.

OS CentOS 7.9  (7.9.2009)
mlnxofed 5.4-1.0.3.0
UCX 1.11.0 (from mlnxofed)
hcoll-4.7.3199 (from mlnxofed)
knem-1.1.4.90  (from mlnxofed)
nVidia HPC-SDK  21.9
OpenMPI 4.1.1
HWLOC 2.5.0

Straight configure

configure CC=nvc CXX=nvc++  FC=nvfortran

dies in

 FCLD libmpi_usempif08.la
/usr/bin/ld: .libs/comm_spawn_multiple_f08.o: relocation R_X86_64_32S against `.rodata' can not be 
used when making a shared object; recompile with -fPIC


configure CC=nvc CXX=nvc++  FC=nvfortran FCFLAGS='-fPIC'

Fixes that, dies in

 CCLD mca_op_avx.la
./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multiple 
definition of `ompi_op_avx_functions_avx2'

./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0):
 first defined here

configure CC=nvc CXX=nvc++ FC=nvfortran FCFLAGS=-fPIC 
--enable-mca-no-build=op-avx

succeeds

And, working up to what I really want, I can build  (soemwhat emulating the 
HPC-X 2.9.0 build)

/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/configure CC=nvc CXX=nvc++ FC=nvfortran FCFLAGS=-fPIC 
--enable-mca-no-build=op-avx --prefix=/stage/opt/OpenMPI/ROME/4.1.1/NV-HPC/21.9 
--with-libevent=internal --enable-mpi1-compatibility --without-xpmem --with-pmi --enable-mpi-cxx 
--with-hwloc=/stage/opt/HWLOC/2.5.0 --with-hcoll=/opt/mellanox/hcoll 
--with-knem=/opt/knem-1.1.4.90mlnx1 --with-cuda=/stage/opt/NV_hpc_sdk/Linux_x86_64/21.9/cuda


adding

--with-platform=/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/contrib/platform/mellanox/optimized

brings back

/usr/bin/ld: final link failed: Bad value
make[2]: *** [libmpi_usempif08.la] Error 2

Since it overrides the FCFLAGS command line setting apparently.
Editing the file to add -fPIC to FCLAGS took care of that as did the FC='nvfortran -fPIC' (which is 
kludgey).



-Ray Muno

On 9/30/21 8:13 AM, Gilles Gouaillardet via users wrote:

  Ray,

there is a typo, the configure option is

--enable-mca-no-build=op-avx

Cheers,

Gilles

- Original Message -

  Added -*-enable-mca-no-build=op-avx *to the configure line. Still dies in 
the same place.
history | grep config




CCLD mca_op_avx.la

./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0):
 multiple
definition of `ompi_op_avx_functions_avx2'

./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0):
 first defined here
./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o): 
In function
`ompi_op_avx_2buff_min_uint16_t_avx2':

/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651:
 multiple
definition of `ompi_op_avx_3buff_functions_avx2'

./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651:
first defined here
make[2]: *** [mca_op_avx.la] Error 2
make[2]: Leaving directory 
`/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mca/op/avx'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi'
make: *** [all-recursive] Error 1

On 9/30/21 5:54 AM, Carl Ponder wrote:


For now, you can suppress this error building OpenMPI 4.1.1


./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0):
multiple definition of `ompi_op_avx_functions_avx2'

./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0):
 first
defined here

./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o): In 
function
`ompi_op_avx_2buff_min_uint16_t_avx2':

/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651:
multiple definition of `ompi_op_avx_3buff_functions_avx2'

./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651:

first defined here 


with the NVHPC/PGI 21.9 compiler by using the setting

configure -*-enable-mca-no-build=op-avx* ...

We're still looking at the cause here. I don't have any advice about the 
problem with 21.7.



Subject:Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, 
build hints?
Date:   Wed, 29 Sep 2021 12:25:43 -0500
From:   Ray Muno via users 
Reply-To:   Open MPI Users 
To: users@lists.open-mpi.org
CC: Ray Muno 



External email: Use caution opening links or attachments


Tried this

configure CC='nvc -fPIC' CXX='nvc++ -fPIC' FC='nvfortran -fPIC'

Configure completes. Compiles quite a way through. Dies in a different 
place. It does get past the
first error, however with libmpi_usempif08.la

Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?

2021-09-29 Thread Ray Muno via users



Tried this

configure   CC='nvc -fPIC' CXX='nvc++ -fPIC' FC='nvfortran -fPIC'

Configure completes. Compiles quite a way through. Dies in a different place. It does get past the 
first error, however with libmpi_usempif08.la



  FCLD libmpi_usempif08.la
make[2]: Leaving directory 
`/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mpi/fortran/use-mpi-f08'

Making all in mpi/fortran/mpiext-use-mpi-f08
make[2]: Entering directory 
`/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mpi/fortran/mpiext-use-mpi-f08'

  PPFC mpi-f08-ext-module.lo
  FCLD libforce_usempif08_module_to_be_built.la
make[2]: Leaving directory 
`/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mpi/fortran/mpiext-use-mpi-f08'


Dies here now.

 CCLD liblocal_ops_avx512.la
  CCLD mca_op_avx.la
./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multiple 
definition of `ompi_op_avx_functions_avx2'

./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0):
 first defined here
./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o): In function 
`ompi_op_avx_2buff_min_uint16_t_avx2':
/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: multiple 
definition of `ompi_op_avx_3buff_functions_avx2'
./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: 
first defined here

make[2]: *** [mca_op_avx.la] Error 2
make[2]: Leaving directory 
`/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mca/op/avx'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi'
make: *** [all-recursive] Error 1


On 9/29/21 11:42 AM, Bennet Fauber via users wrote:

Ray,

If all the errors about not being compiled with -fPIC are still appearing, there may be a bug that 
is preventing the option from getting through to the compiler(s).  It might be worth looking through 
the logs to see the full compile command for one or more of them to see whether that is true?  Say, 
libs/comm_spawn_multiple_f08.o for example?


If -fPIC is missing, you may be able to recompile that manually with the -fPIC in place, then remake 
and see if that also causes the link error to go away, that would be a good start.


Hope this helps,    -- bennet



On Wed, Sep 29, 2021 at 12:29 PM Ray Muno via users <mailto:users@lists.open-mpi.org>> wrote:


I did try that and it fails at the same place.

Which version of the nVidia HPC-SDK are you using? I a m using 21.7.  I see 
there is an upgrade to
21.9, which came out since I installed.  I have that installed and will try 
to see if they changed
anything. Not much in the releases notes to indicate any major changes.

-Ray Muno


On 9/29/21 10:54 AM, Jing Gong wrote:
 > Hi,
 >
 >
 > Before Nvidia persons look into details,pProbably you can try to add the flag 
"-fPIC" to the
 > nvhpc compiler likes cc="nvc -fPIC", which at least worked with me.
 >
 >
 >
 > /Jing
 >
 >


 > *From:* users mailto:users-boun...@lists.open-mpi.org>> on
behalf of Ray Muno via users
 > mailto:users@lists.open-mpi.org>>
 > *Sent:* Wednesday, September 29, 2021 17:22
 > *To:* Open MPI User's List
 > *Cc:* Ray Muno
 > *Subject:* Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, 
build hints?
 > Thanks, I looked through previous emails here in the user list.  I guess 
I need to subscribe
to the
 > Developers list.
 >
 > -Ray Muno
 >
 > On 9/29/21 9:58 AM, Jeff Squyres (jsquyres) wrote:
 >> Ray --
 >>
 >> Looks like this is a dup of https://github.com/open-mpi/ompi/issues/8919
<https://github.com/open-mpi/ompi/issues/8919> 
<https://github.com/open-mpi/ompi/issues/8919
<https://github.com/open-mpi/ompi/issues/8919>>
 >> <https://github.com/open-mpi/ompi/issues/8919 
<https://github.com/open-mpi/ompi/issues/8919>
<https://github.com/open-mpi/ompi/issues/8919 
<https://github.com/open-mpi/ompi/issues/8919>>>.
 >>
 >>

 >
    -- 


   Ray Muno
   IT Systems Administrator
   e-mail: m...@umn.edu <mailto:m...@umn.edu>

   University of Minnesota
   Aerospace Engineering and Mechanics




--

 Ray Muno
 IT Systems Administrator
 e-mail:   m...@umn.edu
 Phone:   (612) 625-9531

 University of Minnesota
 Aerospace Engineering and Mechanics
 110 Union St. S.E.
 Minneapolis, MN 55455


Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?

2021-09-29 Thread Ray Muno via users

In config.log, it appears to be passing the flag in the tests.

It also fails with version 21.9 of the HPC-SDK.

On 9/29/21 11:42 AM, Bennet Fauber via users wrote:

Ray,

If all the errors about not being compiled with -fPIC are still appearing, there may be a bug that 
is preventing the option from getting through to the compiler(s).  It might be worth looking through 
the logs to see the full compile command for one or more of them to see whether that is true?  Say, 
libs/comm_spawn_multiple_f08.o for example?


If -fPIC is missing, you may be able to recompile that manually with the -fPIC in place, then remake 
and see if that also causes the link error to go away, that would be a good start.


Hope this helps,    -- bennet



On Wed, Sep 29, 2021 at 12:29 PM Ray Muno via users <mailto:users@lists.open-mpi.org>> wrote:


I did try that and it fails at the same place.

Which version of the nVidia HPC-SDK are you using? I a m using 21.7.  I see 
there is an upgrade to
21.9, which came out since I installed.  I have that installed and will try 
to see if they changed
anything. Not much in the releases notes to indicate any major changes.

-Ray Muno


On 9/29/21 10:54 AM, Jing Gong wrote:
 > Hi,
 >
 >
 > Before Nvidia persons look into details,pProbably you can try to add the flag 
"-fPIC" to the
 > nvhpc compiler likes cc="nvc -fPIC", which at least worked with me.
 >
 >
 >
 > /Jing
 >
 >


 > *From:* users mailto:users-boun...@lists.open-mpi.org>> on
behalf of Ray Muno via users
 > mailto:users@lists.open-mpi.org>>
 > *Sent:* Wednesday, September 29, 2021 17:22
 > *To:* Open MPI User's List
 > *Cc:* Ray Muno
 > *Subject:* Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, 
build hints?
 > Thanks, I looked through previous emails here in the user list.  I guess 
I need to subscribe
to the
 > Developers list.
 >
 > -Ray Muno
 >
 > On 9/29/21 9:58 AM, Jeff Squyres (jsquyres) wrote:
 >> Ray --
 >>
 >> Looks like this is a dup of https://github.com/open-mpi/ompi/issues/8919
<https://github.com/open-mpi/ompi/issues/8919> 
<https://github.com/open-mpi/ompi/issues/8919
<https://github.com/open-mpi/ompi/issues/8919>>
 >> <https://github.com/open-mpi/ompi/issues/8919 
<https://github.com/open-mpi/ompi/issues/8919>
<https://github.com/open-mpi/ompi/issues/8919 
<https://github.com/open-mpi/ompi/issues/8919>>>.
 >>
 >>

 >
    -- 


   Ray Muno
   IT Systems Administrator
   e-mail: m...@umn.edu <mailto:m...@umn.edu>

   University of Minnesota
   Aerospace Engineering and Mechanics




--

 Ray Muno
 IT Systems Administrator
 e-mail:   m...@umn.edu
 Phone:   (612) 625-9531

 University of Minnesota
 Aerospace Engineering and Mechanics
 110 Union St. S.E.
 Minneapolis, MN 55455


Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?

2021-09-29 Thread Ray Muno via users

I did try that and it fails at the same place.

Which version of the nVidia HPC-SDK are you using? I a m using 21.7.  I see there is an upgrade to 
21.9, which came out since I installed.  I have that installed and will try to see if they changed 
anything. Not much in the releases notes to indicate any major changes.


-Ray Muno


On 9/29/21 10:54 AM, Jing Gong wrote:

Hi,


Before Nvidia persons look into details,pProbably you can try to add the flag "-fPIC" to the 
nvhpc compiler likes cc="nvc -fPIC", which at least worked with me.




/Jing


*From:* users  on behalf of Ray Muno via users 


*Sent:* Wednesday, September 29, 2021 17:22
*To:* Open MPI User's List
*Cc:* Ray Muno
*Subject:* Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build 
hints?
Thanks, I looked through previous emails here in the user list.  I guess I need 
to subscribe to the
Developers list.

-Ray Muno

On 9/29/21 9:58 AM, Jeff Squyres (jsquyres) wrote:

Ray --

Looks like this is a dup of https://github.com/open-mpi/ompi/issues/8919 
<https://github.com/open-mpi/ompi/issues/8919>
<https://github.com/open-mpi/ompi/issues/8919 
<https://github.com/open-mpi/ompi/issues/8919>>.







--

 Ray Muno
 IT Systems Administrator
 e-mail:   m...@umn.edu

 University of Minnesota
 Aerospace Engineering and Mechanics



Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?

2021-09-29 Thread Ray Muno via users

On 9/29/21 9:58 AM, Jeff Squyres (jsquyres) wrote:

Ray --

Looks like this is a dup of https://github.com/open-mpi/ompi/issues/8919 
<https://github.com/open-mpi/ompi/issues/8919>.





Looking at the page on the OpenMPI Developer list, it says

"A known workaround is to add '-fPIC' to the CFLAGS, CXXFLAGS, FCFLAGS (maybe not need to all of 
those)."


Tried adding these, still fails at the same place.



--

 Ray Muno
 IT Systems Administrator
 e-mail:   m...@umn.edu

 University of Minnesota
 Aerospace Engineering and Mechanics



Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?

2021-09-29 Thread Ray Muno via users
Thanks, I looked through previous emails here in the user list.  I guess I need to subscribe to the 
Developers list.


-Ray Muno

On 9/29/21 9:58 AM, Jeff Squyres (jsquyres) wrote:

Ray --

Looks like this is a dup of https://github.com/open-mpi/ompi/issues/8919 
<https://github.com/open-mpi/ompi/issues/8919>.




On Sep 29, 2021, at 10:47 AM, Ray Muno via users mailto:users@lists.open-mpi.org>> wrote:

Looking to compile

OpenMPI 4.1.1

under Centos 7.9, (with Mellanox OFED 5.3 stack)

with the nVidia HPC-SDK, version 21.7.

Configure works, build fails.

The nvidia HPC-SDK is using their environmental module which sets these 
variables, etc.


setenvNVHPC /stage/opt/NV_hpc_sdk
setenvCC /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvc
setenvCXX /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvc++
setenvFC /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvfortran
setenvF90 /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvfortran
setenvF77 /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvfortran
setenvCPP cpp
prepend-pathPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/cuda/bin
prepend-pathPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin
prepend-pathPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/extras/qd/bin
prepend-pathLD_LIBRARY_PATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/cuda/lib64
prepend-pathLD_LIBRARY_PATH 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/cuda/extras/CUPTI/lib64
prepend-pathLD_LIBRARY_PATH 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/extras/qd/lib
prepend-pathLD_LIBRARY_PATH 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/lib
prepend-pathLD_LIBRARY_PATH 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/math_libs/lib64
prepend-pathLD_LIBRARY_PATH 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/comm_libs/nccl/lib
prepend-pathLD_LIBRARY_PATH 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/comm_libs/nvshmem/lib
prepend-pathCPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/math_libs/include
prepend-pathCPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/comm_libs/nccl/include
prepend-pathCPATH 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/comm_libs/nvshmem/include
prepend-pathCPATH 
/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/extras/qd/include/qd
prepend-pathMANPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/man
---


configure  --prefix=/stage/opt/OpenMPI/ROME/4.1.1/NV-HPC/21.7 CC=nvc CXX=nvc++ 
FC=nvfortran

...
FCLD libmpi_usempif08.la <http://libmpi_usempif08.la>
/usr/bin/ld: .libs/comm_spawn_multiple_f08.o: relocation R_X86_64_32S against `.rodata' can not be 
used when making a shared object; recompile with -fPIC
/usr/bin/ld: .libs/startall_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when 
making a shared object; recompile with -fPIC
/usr/bin/ld: .libs/testall_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when 
making a shared object; recompile with -fPIC
/usr/bin/ld: .libs/testany_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when 
making a shared object; recompile with -fPIC
/usr/bin/ld: .libs/testsome_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when 
making a shared object; recompile with -fPIC
/usr/bin/ld: .libs/type_create_struct_f08.o: relocation R_X86_64_32S against `.rodata' can not be 
used when making a shared object; recompile with -fPIC
/usr/bin/ld: .libs/type_get_contents_f08.o: relocation R_X86_64_32S against `.rodata' can not be 
used when making a shared object; recompile with -fPIC
/usr/bin/ld: .libs/waitall_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when 
making a shared object; recompile with -fPIC
/usr/bin/ld: .libs/waitany_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when 
making a shared object; recompile with -fPIC
/usr/bin/ld: .libs/waitsome_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when 
making a shared object; recompile with -fPIC
/usr/bin/ld: profile/.libs/pcomm_spawn_multiple_f08.o: relocation R_X86_64_32S against `.rodata' 
can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: profile/.libs/pstartall_f08.o: relocation R_X86_64_32S against `.rodata' can not be 
used when making a shared object; recompile with -fPIC
/usr/bin/ld: profile/.libs/ptestall_f08.o: relocation R_X86_64_32S against `.rodata' can not be 
used when making a shared object; recompile with -fPIC
/usr/bin/ld: profile/.libs/ptestany_f08.o: relocation R_X86_64_32S against `.rodata' can not be 
used when making a shared object; recompile with -fPIC
/usr/bin/ld: profile/.libs/ptestsome_f08.o: relocation R_X86_64_32S against `.rodata' can not be 
used when making a shared object; recompile with -fPIC
/usr/bin/ld: profile/.libs/ptype_create_struct_f08.o: relocation R_X86_64_32S against `.rodata' 
can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: profile/.libs/ptype_get_contents_f08.o: reloc

[OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?

2021-09-29 Thread Ray Muno via users
/pwaitany_f08.o: relocation R_X86_64_32S against `.rodata' can not be used 
when making a shared object; recompile with -fPIC
/usr/bin/ld: profile/.libs/pwaitsome_f08.o: relocation R_X86_64_32S against `.rodata' can not be 
used when making a shared object; recompile with -fPIC
/usr/bin/ld: .libs/abort_f08.o: relocation R_X86_64_PC32 against symbol `ompi_abort_f' can not be 
used when making a shared object; recompile with -fPIC

/usr/bin/ld: final link failed: Bad value
make[2]: *** [libmpi_usempif08.la] Error 2
make[2]: Leaving directory 
`/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.7/ompi/mpi/fortran/use-mpi-f08'

make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.7/ompi'
make: *** [all-recursive] Error 1






--

 Ray Muno
 IT Systems Administrator
 e-mail:   m...@umn.edu
 University of Minnesota
 Aerospace Engineering and Mechanics


Re: [OMPI users] openmpi/pmix/ucx

2020-02-07 Thread Ray Muno via users

Were using MLNX_OFED 4.7.3. It supplies UCX 1.7.0.

We have OpenMPI 4.02 compiled against the Mellanox OFED 4.7.3 provided versions of UCX, KNEM and 
HCOLL, along with HWLOC 2.1.0 from the OpenMPI site.


I mirrored the build to be what Mellanox used to configure OpenMPI in HPC-X 2.5.

I have users using GCC, PGI, Intel and AOCC compilers with this config.  PGI was the only one that 
was a challenge to build due to conflicts with HCOLL.


-Ray Muno

On 2/7/20 10:04 AM, Michael Di Domenico via users wrote:

i haven't compiled openmpi in a while, but i'm in the process of
upgrading our cluster.

the last time i did this there were specific versions of mpi/pmix/ucx
that were all tested and supposed to work together.  my understanding
of this was because pmi/ucx was under rapid development and the api's
were changing

is that still an issue or can i take the latest stable branches from
git for each and have a relatively good shot at it all working
together?

the one semi-immovable i have right now is ucx which is at 1.7.0 as
installed by mellanox ofed.  if the above is true, is there a matrix
of versions i should be using for all the others?  nothing jumped out
at me on the openmpi website




--

 Ray Muno
 IT Manager
 e-mail:   m...@aem.umn.edu
 University of Minnesota
 Aerospace Engineering and Mechanics Mechanical Engineering



Re: [OMPI users] OpenMPI 4.0.2 with PGI 19.10, will not build with hcoll

2020-01-28 Thread Ray Muno via users

I opened a case with pgroup support regarding this.

We are also using Slurm along with HCOLL.

-Ray Muno

On 1/26/20 5:52 AM, Åke Sandgren via users wrote:

Note that when built against SLURM it will pick up pthread from
libslurm.la too.

On 1/26/20 4:37 AM, Gilles Gouaillardet via users wrote:

Thanks Jeff for the information and sharing the pointer.

FWIW, this issue typically occurs when libtool pulls the -pthread flag
from libhcoll.la that was compiled with a GNU compiler.
The simplest workaround is to remove libhcoll.la (so libtool simply
links with libhcoll.so and does not pull any compiler flags),
and the right fix is imho to either have the libtool maintainers
handle this case or the PGI/NVIDIA folks add support for the -pthread
flag.

Cheers,

Gilles

On Sun, Jan 26, 2020 at 12:09 PM Jeff Hammond via users
 wrote:


To be more strictly equivalent, you will want to add -D_REENTRANT to add to the 
substitution, but this may not affect hcoll.

https://stackoverflow.com/questions/2127797/significance-of-pthread-flag-when-compiling/2127819#2127819

The proper fix here is a change in OMPI build system, of course, to not set 
-pthread when PGI is used.

Jeff

On Fri, Jan 24, 2020 at 11:31 AM Åke Sandgren via users 
 wrote:


PGI needs this in its, for instance, siterc or localrc:
# replace unknown switch -pthread with -lpthread
switch -pthread is replace(-lpthread) positional(linker);


On 1/24/20 8:12 PM, Raymond Muno via users wrote:

I am having issues building OpenMPI 4.0.2 using the PGI 19.10
compilers.  OS is CentOS 7.7, MLNX_OFED 4.7.3

It dies at:

PGC/x86-64 Linux 19.10-0: compilation completed with warnings
   CCLD mca_coll_hcoll.la
pgcc-Error-Unknown switch: -pthread
make[2]: *** [mca_coll_hcoll.la] Error 1
make[2]: Leaving directory
`/project/muno/OpenMPI/PGI/openmpi-4.0.2/ompi/mca/coll/hcoll'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/project/muno/OpenMPI/PGI/openmpi-4.0.2/ompi'
make: *** [all-recursive] Error 1

I tried with PGI 19.9 and had the same issue.

If I do not include hcoll, it builds.  I have successfully built OpenMPI
4.0.2 with GCC, Intel and AOCC compilers, all using the same options.

hcoll is provided by MLNX_OFED 4.7.3 and configure is run with

--with-hcoll=/opt/mellanox/hcoll




--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/





--

 Ray Muno
 IT Manager
 University of Minnesota
 Aerospace Engineering and Mechanics Mechanical Engineering


Re: [OMPI users] OpenMPI 1.4.2 with Myrinet MX, mpirun seg faults

2010-10-28 Thread Ray Muno
On 10/28/2010 01:40 PM, Scott Atchley wrote:

> 
> Does your environment have LD_LIBRARY_PATH set to point to $OMPI/lib and 
> $MX/lib? Does it get set on login? Is $OMPI/bin in your PATH?
> 
> Scott

$MX/lib was not in LD_LIBRARY_PATH

That is interesting. On the head node,

 [/etc/ld.so.conf.d]$ more mx.conf
/opt/mx/lib

But that is not there on the compute nodes.  It must have been there
before the rebuild.

I was looking in /etc/ld.so.con* for things that were getting in my way
but not for things that were missing.

In any event, adding the $MX/lib to the relevant module takes care of
the issue.

Thank you...
-- 

 Ray Muno
 University of Minnesota


Re: [OMPI users] OpenMPI 1.4.2 with Myrinet MX, mpirun seg faults

2010-10-28 Thread Ray Muno
On 10/22/2010 07:36 AM, Scott Atchley wrote:
> Ray,
> 
> Looking back at your original message, you say that it works if you use the 
> Myricom supplied mpirun from the Myrinet roll. I wonder if this is a mismatch 
> between libraries on the compute nodes.
> 
> What do you get if you use your OMPI's mpirun with:
> 
> $ mpirun -n 1 -H  ldd $PWD/
> 
> I am wondering if ldd find the libraries from your compile or the Myrinet 
> roll.
> 

OK, a bit of a hiatus trying to get this resolved.  Had to tend other
fires...

I do think I had an issue of mixed environments.   It is a Rocks 5.3
test cluster and it had an old version of OpenMPI installed as part of
the Rocks 5.3 HPC roll.  I have no removed the HPC roll. All nodes were
rebuilt.

In the previous setup, we could actually run OpenMPI jobs over MX.

With all other spurious versions of OpenMPI (and MPICH for that matter)
removed, I have rebuilt and re-installed, from a fresh source tree,
OpenMPI 1.4.3. It was built with PGI 10.8 compilers.

Now, we cannot run with MX at all.

The install was built with MX.

$ ompi_info | grep mx
 MCA btl: mx (MCA v2.0, API v2.0, Component v1.4.3)
 MCA mtl: mx (MCA v2.0, API v2.0, Component v1.4.3)

I can run with TCP, but now I get

[compute-0-1.local:24863] mca: base: component_find: unable to open
/share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx: perhaps a
missing symbol, or compiled for a different version of Open MPI? (ignored)

$ ls -l /share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx*
-rwxr-xr-x 1 muno muno  1070 Oct 28 12:49
/share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx.la
-rwxr-xr-x 1 muno muno 32044 Oct 28 12:49
/share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx.so

mpirun -v -nolocal -np 96 --x MX_RCACHE=2 -hostfile machines  --mca mtl
mx --mca pml cm cpi.pgi
[compute-0-3.local:21116] mca: base: component_find: unable to open
/share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx: perhaps a
missing symbol, or compiled for a different version of Open MPI? (ignored)
[compute-0-3.local:21115] mca: base: component_find: unable to open
/share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx: perhaps a
missing symbol, or compiled for a different version of Open MPI? (ignored)
--
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  compute-0-3.local
Framework: mtl
Component: mx
--
[compute-0-3.local:21116] mca: base: components_open: component pml / cm
open function failed
--
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  compute-0-3.local
Framework: mtl
Component: mx
--
[compute-0-3.local:21115] mca: base: components_open: component pml / cm
open function failed
[compute-0-3.local:21117] mca: base: component_find: unable to open
/share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx: perhaps a
missing symbol, or compiled for a different version of Open MPI? (ignored)
----------



--
 Ray Muno
 University of Minnesota


Re: [OMPI users] OpenMPI and SGE

2009-06-25 Thread Ray Muno
As a follow up, the problem was with host name resolution. The error was
introduced, with a change to the Rocks environment, which broke reverse
lookups for host names.


-- 

 Ray Muno


Re: [OMPI users] OpenMPI and SGE

2009-06-23 Thread Ray Muno
Rolf Vandevaart wrote:

>>
>> PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required
>> environment variable: MPIRUN_RANK
>> PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: unitialized MPI task:
>> Missing required environment variable: MPIRUN_RANK
>>   
> I do not recognize these errors as part of Open MPI.  A google search
> showed they might be coming from MVAPICH.  Is there a chance we are
> using Open MPI to launch the jobs (via Open MPI mpirun) but we are
> actually launching an application that is linked to MVAPICH?
> 
>
You are correct. I was trying to run the MVAPICH compiled test program.

With an OpenMPI compiled test, I do get an extra line of output with the
verbose flag. The program just hangs at that point.

[muno@compute-6-30 ~]$ which mpirun
/share/apps/opt/openmpi_pgi/bin/mpirun


[muno@compute-6-30 ~]$ldd a.out
libmpi_f90.so.0 =>
/share/apps/opt/openmpi_pgi/lib/libmpi_f90.so.0 (0x2aaad000)
libmpi_f77.so.0 =>
/share/apps/opt/openmpi_pgi/lib/libmpi_f77.so.0 (0x2acb)
libmpi.so.0 => /share/apps/opt/openmpi_pgi/lib/libmpi.so.0
(0x2aee)
...


 mpirun -np $NSLOTS -mca pls_gridengine_verbose 1 a.out
Starting server daemon at host "compute-6-25.local"
Starting server daemon at host "compute-1-1.local"
Server daemon successfully started with task id "1.compute-6-25"
error: commlib error: access denied (client IP resolved to host name "".
This is not identical to clients host name "")
error: executing task of job 12144 failed: failed sending task to
execd@compute-1-1.local: can't find connection
[compute-6-25.local:10810] ERROR: A daemon on node compute-1-1.local
failed to start as expected.
[compute-6-25.local:10810] ERROR: There may be more information
available from
[compute-6-25.local:10810] ERROR: the 'qstat -t' command on the Grid
Engine tasks.
[compute-6-25.local:10810] ERROR: If the problem persists, please
restart the
[compute-6-25.local:10810] ERROR: Grid Engine PE job
[compute-6-25.local:10810] ERROR: The daemon exited unexpectedly with
status 1.
Establishing /usr/bin/ssh session to host compute-6-25.local ...



-- 

 Ray Muno


Re: [OMPI users] OpenMPI and SGE

2009-06-23 Thread Ray Muno
Ray Muno wrote:

> Tha give me

How about "That gives me"

> 
> PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required
> environment variable: MPIRUN_RANK
> PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: unitialized MPI task:
> Missing required environment variable: MPIRUN_RANK
> 
> 


-- 

 Ray Muno


Re: [OMPI users] OpenMPI and SGE

2009-06-23 Thread Ray Muno
Rolf Vandevaart wrote:
> Ray Muno wrote:
>> Ray Muno wrote:
>>   
>>> We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
>>> Scheduling is done through SGE.  MPI communication is over InfiniBand.
>>>
>>> 
>>
>> We also have OpenMPI 1.3 installed and receive similar errors.-
>>
>>   
> This does sound like a problem with SGE.  By default, we use qrsh to
> start the jobs on all the remote nodes.  I believe that is the command
> that is failing.  There are two things you can try to get more info
> depending on the version of Open MPI.   With version 1.2, you can try
> this to get more information.
> 
> |--mca pls_gridengine_verbose 1|
> 
This did not look like it gave me any more info.

> With Open MPI 1.3.2 and later the verbose flag will not help.  But
> instead, you can disable the use of qrsh and instead use rsh/ssh to
> start the remote jobs.
> 
> --mca plm_rsh_disable_qrsh 1
> 

Tha give me

PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required
environment variable: MPIRUN_RANK
PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: unitialized MPI task:
Missing required environment variable: MPIRUN_RANK


-- 

 Ray Muno
 University of Minnesota


[OMPI users] OpenMPI and SGE

2009-06-23 Thread Ray Muno
We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
Scheduling is done through SGE.  MPI communication is over InfiniBand.

We have been running with this setup for over 9 months.  Last week, all
user jobs stopped executing (cluster load dropped to zero).  User can
schedule jobs but when they try to execute, they get errors of the form:

-
[compute-2-5.local:12321] ERROR: The daemon exited unexpectedly with
status 1.
error: commlib error: access denied (client IP resolved to host name "".
This is not identical to clients host name "")
error: executing task of job 11901 failed: failed sending task to
execd@compute-5-9.local: can't find connection
[compute-2-5.local:12321] ERROR: A daemon on node compute-5-9.local
failed to start as expected.
[compute-2-5.local:12321] ERROR: There may be more information available
from
[compute-2-5.local:12321] ERROR: the 'qstat -t' command on the Grid
Engine tasks.
[compute-2-5.local:12321] ERROR: If the problem persists, please restart
the
[compute-2-5.local:12321] ERROR: Grid Engine PE job
[compute-2-5.local:12321] ERROR: The daemon exited unexpectedly with
status 1.
error: commlib error: access denied (client IP resolved to host name "".
This is not identical to clients host name "")


When run interactively, we see.

-
error: commlib error: access denied (client IP resolved to host name "".
This is not identical to clients host name "")
error: executing task of job 12094 failed: failed sending task to
execd@compute-4-11.local: can't find connection
--
A daemon (pid 4938) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished
-

This seems to be an error with SGE but it is only affecting OpenMPI.
User can successfully launch and run jobs with MVAPICH.

Some changes were made to the ROCKS setup that may have caused this but
I have not found where the actual problems lies.

-- 

 Ray Muno
 University of Minnesota


[OMPI users] OpenMPI 1.3RC2 job startup issue

2008-12-22 Thread Ray Muno
We have been happily running under OpenMPI 1.2 on our cluster unitil 
recently.  It is 2200 processors (8 way Opteron) , Qlogic IB connected.


We have had issues starting larger jobs (600+ processors).  There seemed 
to be some indication that OpenMPI may solve our problems.


It built with no problem and installed. Users can compile programs.

When they tried to run, they got the attached output.  Are we missing 
something obvious?


This is a Rocks cluster with jobs scheduled through SGE.

=
$ mpirun -np 1024 program

[compute-2-6.local:32580] Error: unknown option "--daemonize"
Usage: orted [OPTION]...
Start an Open RTE Daemon

   --bootproxy Run as boot proxy for 
-d|--debug   Debug the OpenRTE
-d|--spinHave the orted spin until we can connect a debugger
 to it
   --debug-daemons   Enable debugging of OpenRTE daemons
   --debug-daemons-file  Enable debugging of OpenRTE daemons, storing 
output

 in files
   --gprreplicaRegistry contact information.
-h|--helpThis help message
   --mpi-call-yield 
 Have MPI (or similar) applications call yield when
 idle
   --name  Set the orte process name
   --no-daemonizeDon't daemonize into the background
   --nodename  Node name as specified by host/resource
 description.
   --ns-ndsset sds/nds component to use for daemon (normally
 not needed)
   --nsreplica Name service contact information.
   --num_procs Set the number of process in this job
   --persistent  Remain alive after the application process
 completes
   --report-uriReport this process' uri on indicated pipe
   --scope Set restrictions on who can connect to this
 universe
   --seedHost replicas for the core universe services
   --set-sid Direct the orted to separate from the current
 session
   --tmpdirSet the root for the session directory tree
   --universe  Set the universe name as
 username@hostname:universe_name for this
 application
   --vpid_startSet the starting vpid for this job
--
A daemon (pid 4151) died unexpectedly with status 251 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
compute-5-15.local - daemon did not report back when launched
compute-5-35.local - daemon did not report back when launched
compute-4-8.local - daemon did not report back when launched
compute-7-2.local - daemon did not report back when launched
compute-2-6.local - daemon did not report back when launched
compute-6-28.local - daemon did not report back when launched
compute-6-35.local - daemon did not report back when launched
compute-6-25.local
compute-6-26.local
compute-2-19.local - daemon did not report back when launched
compute-6-37.local - daemon did not report back when launched
compute-6-12.local - daemon did not report back when launched
compute-2-36.local - daemon did not report back when launched
compute-7-5.local - daemon did not report back when launched
compute-7-23.local - daemon did not report back when launched

========

--

 Ray Muno
 University of Minnesota


Re: [OMPI users] /dev/shm

2008-11-20 Thread Ray Muno

John Hearns wrote:

2008/11/19 Ray Muno <m...@aem.umn.edu>


Thought I would revisit this one.

We are still having issues with this. It is not clear to me what is leaving
the user files behind in /dev/shm.

This is not something users are doing directly, they are just compiling
their code directly with mpif90 (from OpenMPI), using various compilers.
Compilers in use are PGI, Intel, SunStudio and Pathscale.

It looks like every job run leaves something behind in /dev/shm and it
slowly fills up.   We are just clearing these out at this point.



Could you not run ipcs with the -p flag every few minutes and try to figure
out what the processes are which are leaving these?
(by that I mean catch them when they are live, and the creating process is
still on the system)



OK, what should I be seeing when I run "ipcs -p"?

I have a set of nodes that has a user job running on it. It wrote a file 
in /dev/shm when it started.  The job is still running.


I see

# ipcs -p

-- Shared Memory Creator/Last-op 
shmid  owner  cpid   lpid


-- Message Queues PIDs 
msqid  owner  lspid  lrpid




--

 Ray Muno


Re: [OMPI users] /dev/shm

2008-11-19 Thread Ray Muno

Ralph Castain wrote:

Hi Ray

Are the jobs that leave files behind terminating normally or aborting? 
Are there any warnings/error messages out of mpirun?


Just trying to determine if this is an abnormal termination issue or a 
bug in OMPI itself.


Ralph



As far as I know, they are from jobs that are terminating normally. I 
have had no notice from users of errors. We are still trying to get a 
handle on this.


With 30 users and 280+ nodes, it is something we have not tracked down 
completely. We are just seeing the after effects of the stale files 
getting left behind. At some point, new jobs do not launch.



--

 Ray Muno


Re: [OMPI users] /dev/shm

2008-11-19 Thread Ray Muno

Thought I would revisit this one.

We are still having issues with this. It is not clear to me what is 
leaving the user files behind in /dev/shm.


This is not something users are doing directly, they are just compiling 
their code directly with mpif90 (from OpenMPI), using various compilers. 
Compilers in use are PGI, Intel, SunStudio and Pathscale.


It looks like every job run leaves something behind in /dev/shm and it 
slowly fills up.   We are just clearing these out at this point.



Jeff Squyres wrote:
That is odd.  Is your user's app crashing or being forcibly killed?  The 
ORTE daemon that is silently launched in v1.2 jobs should ensure that 
files under /tmp/openmpi-sessions-@ are removed.



On Nov 10, 2008, at 2:14 PM, Ray Muno wrote:


Brock Palen wrote:
on most systems /dev/shm is limited to half the physical ram.  Was 
the user someone filling up /dev/shm so there was no space?


The problem is there is a large collection of stale files left in 
there by the users that have run on that node (Rocks based cluster).


I am trying to determine why they are left behind.




--

 Ray Muno
 University of Minnesota
 Aerospace Engineering and Mechanics


Re: [OMPI users] Trouble with OpenMPI and Intel 10.1 compilers

2008-11-11 Thread Ray Muno

Jeff Squyres wrote:

On Nov 11, 2008, at 2:54 PM, Ray Muno wrote:

See 
http://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0. 



OK, that tells me lots of things ;-)

Should I be running configure with --with-wrapper-cflags,
--with-wrapper-fflags etc,

set to include

-i_dynamic


If you want to, sure.  As a guideline, the wrapper compilers put in the 
minimum number of flags to get the job done -- any other flags are a 
local policy decision (and we wouldn't presume to make those for you).


FWIW, the warning is harmless, but I can see how you wouldn't want users 
to see it / be alarmed by it.




If we have them compile with -i_dynamic, are there any implications?

--

 Ray Muno


Re: [OMPI users] Trouble with OpenMPI and Intel 10.1 compilers

2008-11-11 Thread Ray Muno

Jeff Squyres wrote:
See 
http://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0. 




OK, that tells me lots of things ;-)

Should I be running configure with --with-wrapper-cflags,
--with-wrapper-fflags etc,

set to include

-i_dynamic
--

 Ray Muno


Re: [OMPI users] Trouble with OpenMPI and Intel 10.1 compilers

2008-11-11 Thread Ray Muno

Steve Jones wrote:



Are you adding -i_dynamic to base flags, or something different?

Steve


I brought this up to see if something should be changed with the install,

For now, I am leaving that to users.

--

 Ray Muno


Re: [OMPI users] Trouble with OpenMPI and Intel 10.1 compilers

2008-11-11 Thread Ray Muno

Gus Correa wrote:

Hi Ray and list

I have Intel ifort 10.1.017 on a Rocks 4.3 cluster.
The OpenMPI compiler wrappers (i.e. "opal_wrapper") work fine,
and find the shared libraries (Intel or other) without a problem.

My guess is that this is not an OpenMPI problem, but an Intel compiler 
environment glitch.
I wonder if your .profile/.tcshrc/.bashrc files initialize the Intel 
compiler environment properly.
I.e., "source /share/apps/intel/fce/10.1.018/bin/ifortvars.csh" or 
similar, to get the right

Intel environment variables inserted on
PATH, LD_LIBRARY_PATH, MANPATH. and INTEL_LICENSE_FILE.

Not doing this caused trouble for me in the past.
Double or inconsistent assignment of LD_LIBRARY_PATH and PATH
(say on the ifortvars.csh and on the user login files) also caused 
conflicts.


I am not sure if this needs to be done before you configure and install 
OpenMPI,

but doing it after you build OpenMPI may still be OK.

I hope this helps,
Gus Correa



That does help. I confirmed that what I added needs to be in the 
environment (LD_LIBRARY_PATH).  Must have missed that in the docs. I 
have now added the appropriate variables to our modules environment.


Seems strange that OpenMPI built without these being set at all. I could 
also compile test codes with the compilers, just not with mpicc and mpif90.



-Ray Muno


Re: [OMPI users] Trouble with OpenMPI and Intel 10.1 compilers

2008-11-11 Thread Ray Muno

Ray Muno wrote:

I updated the LD_LIBRARY_PATH to point to the directories that contain 
the installed copies of libimf.so. (this is not something I have not had 
to do for other compiler/OpenMpi combinations)




How about...

(this is not something I have had to do for other compiler/OpenMpi 
combinations)


-Ray Muno


[OMPI users] Trouble with OpenMPI and Intel 10.1 compilers

2008-11-11 Thread Ray Muno

We have recently installed the Intel 10,1 compiler suite on our cluster.

I built OpenMPI (1.2.7 and 1.2.8) with

./configure CC=icc CXX=icpc F77=ifort FC=ifort

It configures, builds and installs.

However, the MPI compiler drivers (mpicc, mpif90, etc) fail immediately 
with error of the sort


mpif90: error while loading shared libraries: libimf.so: cannot open 
shared object file: No such file or directory


I updated the LD_LIBRARY_PATH to point to the directories that contain 
the installed copies of libimf.so. (this is not something I have not had 
to do for other compiler/OpenMpi combinations)


At that point, the program will compile but I get warnings like:

[muno@titan ~]$ mpif90 test.f

/share/apps/Intel/fce/10.1.018/lib/libimf.so: warning: warning: 
feupdateenv is not implemented and will always fail


In a google search, I found a reference to this in the OpenMPI lists. 
When I follow the link, it is a different thread. Searching the OpenMPI 
lists from the web page does not find any matches. Strange.


I found some references to this at some other sites using OpenMPI on 
clusters and they said to use


-i_dynamic

on the compile line.

This removes the warning.

Is there something I should be doing at OpenMPI configure time to take 
care of these issues?


--
Ray Muno
University of Minnesota
Aerospace Engineering


Re: [OMPI users] /dev/shm

2008-11-10 Thread Ray Muno

Jeff Squyres wrote:
That is odd.  Is your user's app crashing or being forcibly killed?  The 
ORTE daemon that is silently launched in v1.2 jobs should ensure that 
files under /tmp/openmpi-sessions-@ are removed.





It looks like I see orphaned directories under /tmp/openmpi* as well.


--

 Ray Muno


Re: [OMPI users] /dev/shm

2008-11-10 Thread Ray Muno

Brock Palen wrote:
on most systems /dev/shm is limited to half the physical ram.  Was the 
user someone filling up /dev/shm so there was no space?




The problem is there is a large collection of stale files left in there 
by the users that have run on that node (Rocks based cluster).


I am trying to determine why they are left behind.
--

 Ray Muno
 University of Minnesota
 Aerospace Engineering and Mechanics


[OMPI users] /dev/shm

2008-11-10 Thread Ray Muno
We are running OpenMPI 1.2.7.  Now that we have been running for a 
while, we are getting messages of the sort.


node: Unable to allocate shared memory for intra-node messaging.
node: Delete stale shared memory files in /dev/shm.
MPI process terminated unexpectedly

If the user deletes the stale files, they can run.


--

 Ray Muno
 University of Minnesota
 Aerospace Engineering and Mechanics



Re: [OMPI users] Performance: MPICH2 vs OpenMPI

2008-10-08 Thread Ray Muno
I would be interested in what others have to say about this as well.

We have been doing a bit of performance testing since we are deploying a
new cluster and it is our first InfiniBand based set up.

In our experience, so far, OpenMPI is coming out faster than MVAPICH.
Comparisons were made with different compilers, PGI and Pathscale. We do
not have a running implementation of OpenMPI with SunStudio compilers.

Our tests were with actual user codes running on up to 600 processors so
far.


Sangamesh B wrote:
> Hi All,
> 
>I wanted to switch from mpich2/mvapich2 to OpenMPI, as OpenMPI
> supports both ethernet and infiniband. Before doing that I tested an
> application 'GROMACS' to compare the performance of MPICH2 & OpenMPI. Both
> have been compiled with GNU compilers.
> 
> After this benchmark, I came to know that OpenMPI is slower than MPICH2.
> 
> This benchmark is run on a AMD dual core, dual opteron processor. Both have
> compiled with default configurations.
> 
> The job is run on 2 nodes - 8 cores.
> 
> OpenMPI - 25 m 39 s.
> MPICH2  -  15 m 53 s.
> 
> Any comments ..?
> 
> Thanks,
> Sangamesh
> 

-Ray Muno
 Aerospace Engineering.