Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, buildhints?
/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvc C compiler family name: PGI C compiler version: 21.7-0 C++ compiler: nvc++ C++ compiler absolute: /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvc++ Fort compiler: nvfortran Fort compiler abs: /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvfortran NV-HPC Version 21.9 +++ ompi_info Package: Open MPI muno@loki.local Distribution Open MPI: 4.1.1 Open MPI repo revision: v4.1.1 Open MPI release date: Apr 24, 2021 Open RTE: 4.1.1 Open RTE repo revision: v4.1.1 Open RTE release date: Apr 24, 2021 OPAL: 4.1.1 OPAL repo revision: v4.1.1 OPAL release date: Apr 24, 2021 MPI API: 3.1.0 Ident string: 4.1.1 Prefix: /stage/opt/OpenMPI/ROME/4.1.1/NV-HPC/21.9 Configured architecture: x86_64-pc-linux-gnu Configure host: loki.local Configured by: muno Configured on: Thu Sep 30 17:59:36 UTC 2021 Configure host: loki.local Configure command line: 'CC=nvc' 'CXX=nvc++' 'FC=nvfortran' 'FCFLAGS=-fPIC' '--enable-mca-no-build=op-avx' '--prefix=/stage/opt/OpenMPI/ROME/4.1.1/NV-HPC/21.9' '--with-libevent=internal' '--enable-mpi1-compatibility' '--without-xpmem' '--with-pmi' '--enable-mpi-cxx' '--with-hwloc=/stage/opt/HWLOC/2.5.0' '--with-hcoll=/opt/mellanox/hcoll' '--with-knem=/opt/knem-1.1.4.90mlnx1' '--with-cuda=/stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/cuda' Built by: muno Built on: Thu Sep 30 18:17:02 UTC 2021 Built host: loki.local C bindings: yes C++ bindings: yes Fort mpif.h: yes (all) Fort use mpi: yes (full: ignore TKR) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: yes Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the nvfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: nvc C compiler absolute: /stage/opt/NV_hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvc C compiler family name: PGI C compiler version: 21.9-0 C++ compiler: nvc++ C++ compiler absolute: /stage/opt/NV_hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvc++ Fort compiler: nvfortran On 9/30/21 12:18 PM, Ray Muno via users wrote: OK, starting clean. OS CentOS 7.9 (7.9.2009) mlnxofed 5.4-1.0.3.0 UCX 1.11.0 (from mlnxofed) hcoll-4.7.3199 (from mlnxofed) knem-1.1.4.90 (from mlnxofed) nVidia HPC-SDK 21.9 OpenMPI 4.1.1 HWLOC 2.5.0 Straight configure configure CC=nvc CXX=nvc++ FC=nvfortran dies in FCLD libmpi_usempif08.la /usr/bin/ld: .libs/comm_spawn_multiple_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC configure CC=nvc CXX=nvc++ FC=nvfortran FCFLAGS='-fPIC' Fixes that, dies in CCLD mca_op_avx.la ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multiple definition of `ompi_op_avx_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0): first defined here configure CC=nvc CXX=nvc++ FC=nvfortran FCFLAGS=-fPIC --enable-mca-no-build=op-avx succeeds And, working up to what I really want, I can build (soemwhat emulating the HPC-X 2.9.0 build) /project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/configure CC=nvc CXX=nvc++ FC=nvfortran FCFLAGS=-fPIC --enable-mca-no-build=op-avx --prefix=/stage/opt/OpenMPI/ROME/4.1.1/NV-HPC/21.9 --with-libevent=internal --enable-mpi1-compatibility --without-xpmem --with-pmi --enable-mpi-cxx --with-hwloc=/stage/opt/HWLOC/2.5.0 --with-hcoll=/opt/mellanox/hcoll --with-knem=/opt/knem-1.1.4.90mlnx1 --with-cuda=/stage/opt/NV_hpc_sdk/Linux_x86_64/21.9/cuda adding --with-platform=/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/contrib/platform/mellanox/optimized brings back /usr/bin/ld: final link failed: Bad value make[2]: *** [libmpi_usempif08.la] Error 2 Since it overrides the FCFLAGS command line setting apparently. Editing the file to add -fPIC to FCLAGS took care of that as did the FC='nvfortran -fPIC' (which is kludgey). -Ray Muno On 9/30/21 8:13 AM, Gilles Gouaillardet via users wrote: Ray, there is a typo, the configure option is --enable-mca-no-build=op-avx Cheers, Gilles - Original Message
Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, buildhints?
OK, starting clean. OS CentOS 7.9 (7.9.2009) mlnxofed 5.4-1.0.3.0 UCX 1.11.0 (from mlnxofed) hcoll-4.7.3199 (from mlnxofed) knem-1.1.4.90 (from mlnxofed) nVidia HPC-SDK 21.9 OpenMPI 4.1.1 HWLOC 2.5.0 Straight configure configure CC=nvc CXX=nvc++ FC=nvfortran dies in FCLD libmpi_usempif08.la /usr/bin/ld: .libs/comm_spawn_multiple_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC configure CC=nvc CXX=nvc++ FC=nvfortran FCFLAGS='-fPIC' Fixes that, dies in CCLD mca_op_avx.la ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multiple definition of `ompi_op_avx_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0): first defined here configure CC=nvc CXX=nvc++ FC=nvfortran FCFLAGS=-fPIC --enable-mca-no-build=op-avx succeeds And, working up to what I really want, I can build (soemwhat emulating the HPC-X 2.9.0 build) /project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/configure CC=nvc CXX=nvc++ FC=nvfortran FCFLAGS=-fPIC --enable-mca-no-build=op-avx --prefix=/stage/opt/OpenMPI/ROME/4.1.1/NV-HPC/21.9 --with-libevent=internal --enable-mpi1-compatibility --without-xpmem --with-pmi --enable-mpi-cxx --with-hwloc=/stage/opt/HWLOC/2.5.0 --with-hcoll=/opt/mellanox/hcoll --with-knem=/opt/knem-1.1.4.90mlnx1 --with-cuda=/stage/opt/NV_hpc_sdk/Linux_x86_64/21.9/cuda adding --with-platform=/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/contrib/platform/mellanox/optimized brings back /usr/bin/ld: final link failed: Bad value make[2]: *** [libmpi_usempif08.la] Error 2 Since it overrides the FCFLAGS command line setting apparently. Editing the file to add -fPIC to FCLAGS took care of that as did the FC='nvfortran -fPIC' (which is kludgey). -Ray Muno On 9/30/21 8:13 AM, Gilles Gouaillardet via users wrote: Ray, there is a typo, the configure option is --enable-mca-no-build=op-avx Cheers, Gilles - Original Message - Added -*-enable-mca-no-build=op-avx *to the configure line. Still dies in the same place. history | grep config CCLD mca_op_avx.la ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multiple definition of `ompi_op_avx_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0): first defined here ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o): In function `ompi_op_avx_2buff_min_uint16_t_avx2': /project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: multiple definition of `ompi_op_avx_3buff_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: first defined here make[2]: *** [mca_op_avx.la] Error 2 make[2]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mca/op/avx' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi' make: *** [all-recursive] Error 1 On 9/30/21 5:54 AM, Carl Ponder wrote: For now, you can suppress this error building OpenMPI 4.1.1 ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multiple definition of `ompi_op_avx_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0): first defined here ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o): In function `ompi_op_avx_2buff_min_uint16_t_avx2': /project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: multiple definition of `ompi_op_avx_3buff_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: first defined here with the NVHPC/PGI 21.9 compiler by using the setting configure -*-enable-mca-no-build=op-avx* ... We're still looking at the cause here. I don't have any advice about the problem with 21.7. Subject:Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints? Date: Wed, 29 Sep 2021 12:25:43 -0500 From: Ray Muno via users Reply-To: Open MPI Users To: users@lists.open-mpi.org CC: Ray Muno External email: Use caution opening links or attachments Tried this configure CC='nvc -fPIC' CXX='nvc++ -fPIC' FC='nvfortran -fPIC' Configure completes. Compiles quite a way through. Dies in a different place. It does get past the first error, however with libmpi_usempif08.la
Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?
Tried this configure CC='nvc -fPIC' CXX='nvc++ -fPIC' FC='nvfortran -fPIC' Configure completes. Compiles quite a way through. Dies in a different place. It does get past the first error, however with libmpi_usempif08.la FCLD libmpi_usempif08.la make[2]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mpi/fortran/use-mpi-f08' Making all in mpi/fortran/mpiext-use-mpi-f08 make[2]: Entering directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mpi/fortran/mpiext-use-mpi-f08' PPFC mpi-f08-ext-module.lo FCLD libforce_usempif08_module_to_be_built.la make[2]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mpi/fortran/mpiext-use-mpi-f08' Dies here now. CCLD liblocal_ops_avx512.la CCLD mca_op_avx.la ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multiple definition of `ompi_op_avx_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0): first defined here ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o): In function `ompi_op_avx_2buff_min_uint16_t_avx2': /project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: multiple definition of `ompi_op_avx_3buff_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: first defined here make[2]: *** [mca_op_avx.la] Error 2 make[2]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mca/op/avx' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi' make: *** [all-recursive] Error 1 On 9/29/21 11:42 AM, Bennet Fauber via users wrote: Ray, If all the errors about not being compiled with -fPIC are still appearing, there may be a bug that is preventing the option from getting through to the compiler(s). It might be worth looking through the logs to see the full compile command for one or more of them to see whether that is true? Say, libs/comm_spawn_multiple_f08.o for example? If -fPIC is missing, you may be able to recompile that manually with the -fPIC in place, then remake and see if that also causes the link error to go away, that would be a good start. Hope this helps, -- bennet On Wed, Sep 29, 2021 at 12:29 PM Ray Muno via users <mailto:users@lists.open-mpi.org>> wrote: I did try that and it fails at the same place. Which version of the nVidia HPC-SDK are you using? I a m using 21.7. I see there is an upgrade to 21.9, which came out since I installed. I have that installed and will try to see if they changed anything. Not much in the releases notes to indicate any major changes. -Ray Muno On 9/29/21 10:54 AM, Jing Gong wrote: > Hi, > > > Before Nvidia persons look into details,pProbably you can try to add the flag "-fPIC" to the > nvhpc compiler likes cc="nvc -fPIC", which at least worked with me. > > > > /Jing > > > *From:* users mailto:users-boun...@lists.open-mpi.org>> on behalf of Ray Muno via users > mailto:users@lists.open-mpi.org>> > *Sent:* Wednesday, September 29, 2021 17:22 > *To:* Open MPI User's List > *Cc:* Ray Muno > *Subject:* Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints? > Thanks, I looked through previous emails here in the user list. I guess I need to subscribe to the > Developers list. > > -Ray Muno > > On 9/29/21 9:58 AM, Jeff Squyres (jsquyres) wrote: >> Ray -- >> >> Looks like this is a dup of https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919> <https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919>> >> <https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919> <https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919>>>. >> >> > -- Ray Muno IT Systems Administrator e-mail: m...@umn.edu <mailto:m...@umn.edu> University of Minnesota Aerospace Engineering and Mechanics -- Ray Muno IT Systems Administrator e-mail: m...@umn.edu Phone: (612) 625-9531 University of Minnesota Aerospace Engineering and Mechanics 110 Union St. S.E. Minneapolis, MN 55455
Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?
In config.log, it appears to be passing the flag in the tests. It also fails with version 21.9 of the HPC-SDK. On 9/29/21 11:42 AM, Bennet Fauber via users wrote: Ray, If all the errors about not being compiled with -fPIC are still appearing, there may be a bug that is preventing the option from getting through to the compiler(s). It might be worth looking through the logs to see the full compile command for one or more of them to see whether that is true? Say, libs/comm_spawn_multiple_f08.o for example? If -fPIC is missing, you may be able to recompile that manually with the -fPIC in place, then remake and see if that also causes the link error to go away, that would be a good start. Hope this helps, -- bennet On Wed, Sep 29, 2021 at 12:29 PM Ray Muno via users <mailto:users@lists.open-mpi.org>> wrote: I did try that and it fails at the same place. Which version of the nVidia HPC-SDK are you using? I a m using 21.7. I see there is an upgrade to 21.9, which came out since I installed. I have that installed and will try to see if they changed anything. Not much in the releases notes to indicate any major changes. -Ray Muno On 9/29/21 10:54 AM, Jing Gong wrote: > Hi, > > > Before Nvidia persons look into details,pProbably you can try to add the flag "-fPIC" to the > nvhpc compiler likes cc="nvc -fPIC", which at least worked with me. > > > > /Jing > > > *From:* users mailto:users-boun...@lists.open-mpi.org>> on behalf of Ray Muno via users > mailto:users@lists.open-mpi.org>> > *Sent:* Wednesday, September 29, 2021 17:22 > *To:* Open MPI User's List > *Cc:* Ray Muno > *Subject:* Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints? > Thanks, I looked through previous emails here in the user list. I guess I need to subscribe to the > Developers list. > > -Ray Muno > > On 9/29/21 9:58 AM, Jeff Squyres (jsquyres) wrote: >> Ray -- >> >> Looks like this is a dup of https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919> <https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919>> >> <https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919> <https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919>>>. >> >> > -- Ray Muno IT Systems Administrator e-mail: m...@umn.edu <mailto:m...@umn.edu> University of Minnesota Aerospace Engineering and Mechanics -- Ray Muno IT Systems Administrator e-mail: m...@umn.edu Phone: (612) 625-9531 University of Minnesota Aerospace Engineering and Mechanics 110 Union St. S.E. Minneapolis, MN 55455
Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?
I did try that and it fails at the same place. Which version of the nVidia HPC-SDK are you using? I a m using 21.7. I see there is an upgrade to 21.9, which came out since I installed. I have that installed and will try to see if they changed anything. Not much in the releases notes to indicate any major changes. -Ray Muno On 9/29/21 10:54 AM, Jing Gong wrote: Hi, Before Nvidia persons look into details,pProbably you can try to add the flag "-fPIC" to the nvhpc compiler likes cc="nvc -fPIC", which at least worked with me. /Jing *From:* users on behalf of Ray Muno via users *Sent:* Wednesday, September 29, 2021 17:22 *To:* Open MPI User's List *Cc:* Ray Muno *Subject:* Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints? Thanks, I looked through previous emails here in the user list. I guess I need to subscribe to the Developers list. -Ray Muno On 9/29/21 9:58 AM, Jeff Squyres (jsquyres) wrote: Ray -- Looks like this is a dup of https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919> <https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919>>. -- Ray Muno IT Systems Administrator e-mail: m...@umn.edu University of Minnesota Aerospace Engineering and Mechanics
Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?
On 9/29/21 9:58 AM, Jeff Squyres (jsquyres) wrote: Ray -- Looks like this is a dup of https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919>. Looking at the page on the OpenMPI Developer list, it says "A known workaround is to add '-fPIC' to the CFLAGS, CXXFLAGS, FCFLAGS (maybe not need to all of those)." Tried adding these, still fails at the same place. -- Ray Muno IT Systems Administrator e-mail: m...@umn.edu University of Minnesota Aerospace Engineering and Mechanics
Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?
Thanks, I looked through previous emails here in the user list. I guess I need to subscribe to the Developers list. -Ray Muno On 9/29/21 9:58 AM, Jeff Squyres (jsquyres) wrote: Ray -- Looks like this is a dup of https://github.com/open-mpi/ompi/issues/8919 <https://github.com/open-mpi/ompi/issues/8919>. On Sep 29, 2021, at 10:47 AM, Ray Muno via users mailto:users@lists.open-mpi.org>> wrote: Looking to compile OpenMPI 4.1.1 under Centos 7.9, (with Mellanox OFED 5.3 stack) with the nVidia HPC-SDK, version 21.7. Configure works, build fails. The nvidia HPC-SDK is using their environmental module which sets these variables, etc. setenvNVHPC /stage/opt/NV_hpc_sdk setenvCC /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvc setenvCXX /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvc++ setenvFC /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvfortran setenvF90 /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvfortran setenvF77 /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin/nvfortran setenvCPP cpp prepend-pathPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/cuda/bin prepend-pathPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/bin prepend-pathPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/extras/qd/bin prepend-pathLD_LIBRARY_PATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/cuda/lib64 prepend-pathLD_LIBRARY_PATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/cuda/extras/CUPTI/lib64 prepend-pathLD_LIBRARY_PATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/extras/qd/lib prepend-pathLD_LIBRARY_PATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/lib prepend-pathLD_LIBRARY_PATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/math_libs/lib64 prepend-pathLD_LIBRARY_PATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/comm_libs/nccl/lib prepend-pathLD_LIBRARY_PATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/comm_libs/nvshmem/lib prepend-pathCPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/math_libs/include prepend-pathCPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/comm_libs/nccl/include prepend-pathCPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/comm_libs/nvshmem/include prepend-pathCPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/extras/qd/include/qd prepend-pathMANPATH /stage/opt/NV_hpc_sdk/Linux_x86_64/21.7/compilers/man --- configure --prefix=/stage/opt/OpenMPI/ROME/4.1.1/NV-HPC/21.7 CC=nvc CXX=nvc++ FC=nvfortran ... FCLD libmpi_usempif08.la <http://libmpi_usempif08.la> /usr/bin/ld: .libs/comm_spawn_multiple_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: .libs/startall_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: .libs/testall_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: .libs/testany_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: .libs/testsome_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: .libs/type_create_struct_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: .libs/type_get_contents_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: .libs/waitall_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: .libs/waitany_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: .libs/waitsome_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: profile/.libs/pcomm_spawn_multiple_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: profile/.libs/pstartall_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: profile/.libs/ptestall_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: profile/.libs/ptestany_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: profile/.libs/ptestsome_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: profile/.libs/ptype_create_struct_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: profile/.libs/ptype_get_contents_f08.o: reloc
[OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?
/pwaitany_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: profile/.libs/pwaitsome_f08.o: relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: .libs/abort_f08.o: relocation R_X86_64_PC32 against symbol `ompi_abort_f' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: final link failed: Bad value make[2]: *** [libmpi_usempif08.la] Error 2 make[2]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.7/ompi/mpi/fortran/use-mpi-f08' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.7/ompi' make: *** [all-recursive] Error 1 -- Ray Muno IT Systems Administrator e-mail: m...@umn.edu University of Minnesota Aerospace Engineering and Mechanics
Re: [OMPI users] openmpi/pmix/ucx
Were using MLNX_OFED 4.7.3. It supplies UCX 1.7.0. We have OpenMPI 4.02 compiled against the Mellanox OFED 4.7.3 provided versions of UCX, KNEM and HCOLL, along with HWLOC 2.1.0 from the OpenMPI site. I mirrored the build to be what Mellanox used to configure OpenMPI in HPC-X 2.5. I have users using GCC, PGI, Intel and AOCC compilers with this config. PGI was the only one that was a challenge to build due to conflicts with HCOLL. -Ray Muno On 2/7/20 10:04 AM, Michael Di Domenico via users wrote: i haven't compiled openmpi in a while, but i'm in the process of upgrading our cluster. the last time i did this there were specific versions of mpi/pmix/ucx that were all tested and supposed to work together. my understanding of this was because pmi/ucx was under rapid development and the api's were changing is that still an issue or can i take the latest stable branches from git for each and have a relatively good shot at it all working together? the one semi-immovable i have right now is ucx which is at 1.7.0 as installed by mellanox ofed. if the above is true, is there a matrix of versions i should be using for all the others? nothing jumped out at me on the openmpi website -- Ray Muno IT Manager e-mail: m...@aem.umn.edu University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering
Re: [OMPI users] OpenMPI 4.0.2 with PGI 19.10, will not build with hcoll
I opened a case with pgroup support regarding this. We are also using Slurm along with HCOLL. -Ray Muno On 1/26/20 5:52 AM, Åke Sandgren via users wrote: Note that when built against SLURM it will pick up pthread from libslurm.la too. On 1/26/20 4:37 AM, Gilles Gouaillardet via users wrote: Thanks Jeff for the information and sharing the pointer. FWIW, this issue typically occurs when libtool pulls the -pthread flag from libhcoll.la that was compiled with a GNU compiler. The simplest workaround is to remove libhcoll.la (so libtool simply links with libhcoll.so and does not pull any compiler flags), and the right fix is imho to either have the libtool maintainers handle this case or the PGI/NVIDIA folks add support for the -pthread flag. Cheers, Gilles On Sun, Jan 26, 2020 at 12:09 PM Jeff Hammond via users wrote: To be more strictly equivalent, you will want to add -D_REENTRANT to add to the substitution, but this may not affect hcoll. https://stackoverflow.com/questions/2127797/significance-of-pthread-flag-when-compiling/2127819#2127819 The proper fix here is a change in OMPI build system, of course, to not set -pthread when PGI is used. Jeff On Fri, Jan 24, 2020 at 11:31 AM Åke Sandgren via users wrote: PGI needs this in its, for instance, siterc or localrc: # replace unknown switch -pthread with -lpthread switch -pthread is replace(-lpthread) positional(linker); On 1/24/20 8:12 PM, Raymond Muno via users wrote: I am having issues building OpenMPI 4.0.2 using the PGI 19.10 compilers. OS is CentOS 7.7, MLNX_OFED 4.7.3 It dies at: PGC/x86-64 Linux 19.10-0: compilation completed with warnings CCLD mca_coll_hcoll.la pgcc-Error-Unknown switch: -pthread make[2]: *** [mca_coll_hcoll.la] Error 1 make[2]: Leaving directory `/project/muno/OpenMPI/PGI/openmpi-4.0.2/ompi/mca/coll/hcoll' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/project/muno/OpenMPI/PGI/openmpi-4.0.2/ompi' make: *** [all-recursive] Error 1 I tried with PGI 19.9 and had the same issue. If I do not include hcoll, it builds. I have successfully built OpenMPI 4.0.2 with GCC, Intel and AOCC compilers, all using the same options. hcoll is provided by MLNX_OFED 4.7.3 and configure is run with --with-hcoll=/opt/mellanox/hcoll -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/ -- Ray Muno IT Manager University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering
Re: [OMPI users] OpenMPI 1.4.2 with Myrinet MX, mpirun seg faults
On 10/28/2010 01:40 PM, Scott Atchley wrote: > > Does your environment have LD_LIBRARY_PATH set to point to $OMPI/lib and > $MX/lib? Does it get set on login? Is $OMPI/bin in your PATH? > > Scott $MX/lib was not in LD_LIBRARY_PATH That is interesting. On the head node, [/etc/ld.so.conf.d]$ more mx.conf /opt/mx/lib But that is not there on the compute nodes. It must have been there before the rebuild. I was looking in /etc/ld.so.con* for things that were getting in my way but not for things that were missing. In any event, adding the $MX/lib to the relevant module takes care of the issue. Thank you... -- Ray Muno University of Minnesota
Re: [OMPI users] OpenMPI 1.4.2 with Myrinet MX, mpirun seg faults
On 10/22/2010 07:36 AM, Scott Atchley wrote: > Ray, > > Looking back at your original message, you say that it works if you use the > Myricom supplied mpirun from the Myrinet roll. I wonder if this is a mismatch > between libraries on the compute nodes. > > What do you get if you use your OMPI's mpirun with: > > $ mpirun -n 1 -H ldd $PWD/ > > I am wondering if ldd find the libraries from your compile or the Myrinet > roll. > OK, a bit of a hiatus trying to get this resolved. Had to tend other fires... I do think I had an issue of mixed environments. It is a Rocks 5.3 test cluster and it had an old version of OpenMPI installed as part of the Rocks 5.3 HPC roll. I have no removed the HPC roll. All nodes were rebuilt. In the previous setup, we could actually run OpenMPI jobs over MX. With all other spurious versions of OpenMPI (and MPICH for that matter) removed, I have rebuilt and re-installed, from a fresh source tree, OpenMPI 1.4.3. It was built with PGI 10.8 compilers. Now, we cannot run with MX at all. The install was built with MX. $ ompi_info | grep mx MCA btl: mx (MCA v2.0, API v2.0, Component v1.4.3) MCA mtl: mx (MCA v2.0, API v2.0, Component v1.4.3) I can run with TCP, but now I get [compute-0-1.local:24863] mca: base: component_find: unable to open /share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) $ ls -l /share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx* -rwxr-xr-x 1 muno muno 1070 Oct 28 12:49 /share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx.la -rwxr-xr-x 1 muno muno 32044 Oct 28 12:49 /share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx.so mpirun -v -nolocal -np 96 --x MX_RCACHE=2 -hostfile machines --mca mtl mx --mca pml cm cpi.pgi [compute-0-3.local:21116] mca: base: component_find: unable to open /share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [compute-0-3.local:21115] mca: base: component_find: unable to open /share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: compute-0-3.local Framework: mtl Component: mx -- [compute-0-3.local:21116] mca: base: components_open: component pml / cm open function failed -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: compute-0-3.local Framework: mtl Component: mx -- [compute-0-3.local:21115] mca: base: components_open: component pml / cm open function failed [compute-0-3.local:21117] mca: base: component_find: unable to open /share/apps/opt/OpenMPI/1.4.3/PGI/10.8/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) ---------- -- Ray Muno University of Minnesota
Re: [OMPI users] OpenMPI and SGE
As a follow up, the problem was with host name resolution. The error was introduced, with a change to the Rocks environment, which broke reverse lookups for host names. -- Ray Muno
Re: [OMPI users] OpenMPI and SGE
Rolf Vandevaart wrote: >> >> PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required >> environment variable: MPIRUN_RANK >> PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: unitialized MPI task: >> Missing required environment variable: MPIRUN_RANK >> > I do not recognize these errors as part of Open MPI. A google search > showed they might be coming from MVAPICH. Is there a chance we are > using Open MPI to launch the jobs (via Open MPI mpirun) but we are > actually launching an application that is linked to MVAPICH? > > You are correct. I was trying to run the MVAPICH compiled test program. With an OpenMPI compiled test, I do get an extra line of output with the verbose flag. The program just hangs at that point. [muno@compute-6-30 ~]$ which mpirun /share/apps/opt/openmpi_pgi/bin/mpirun [muno@compute-6-30 ~]$ldd a.out libmpi_f90.so.0 => /share/apps/opt/openmpi_pgi/lib/libmpi_f90.so.0 (0x2aaad000) libmpi_f77.so.0 => /share/apps/opt/openmpi_pgi/lib/libmpi_f77.so.0 (0x2acb) libmpi.so.0 => /share/apps/opt/openmpi_pgi/lib/libmpi.so.0 (0x2aee) ... mpirun -np $NSLOTS -mca pls_gridengine_verbose 1 a.out Starting server daemon at host "compute-6-25.local" Starting server daemon at host "compute-1-1.local" Server daemon successfully started with task id "1.compute-6-25" error: commlib error: access denied (client IP resolved to host name "". This is not identical to clients host name "") error: executing task of job 12144 failed: failed sending task to execd@compute-1-1.local: can't find connection [compute-6-25.local:10810] ERROR: A daemon on node compute-1-1.local failed to start as expected. [compute-6-25.local:10810] ERROR: There may be more information available from [compute-6-25.local:10810] ERROR: the 'qstat -t' command on the Grid Engine tasks. [compute-6-25.local:10810] ERROR: If the problem persists, please restart the [compute-6-25.local:10810] ERROR: Grid Engine PE job [compute-6-25.local:10810] ERROR: The daemon exited unexpectedly with status 1. Establishing /usr/bin/ssh session to host compute-6-25.local ... -- Ray Muno
Re: [OMPI users] OpenMPI and SGE
Ray Muno wrote: > Tha give me How about "That gives me" > > PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required > environment variable: MPIRUN_RANK > PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: unitialized MPI task: > Missing required environment variable: MPIRUN_RANK > > -- Ray Muno
Re: [OMPI users] OpenMPI and SGE
Rolf Vandevaart wrote: > Ray Muno wrote: >> Ray Muno wrote: >> >>> We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily). >>> Scheduling is done through SGE. MPI communication is over InfiniBand. >>> >>> >> >> We also have OpenMPI 1.3 installed and receive similar errors.- >> >> > This does sound like a problem with SGE. By default, we use qrsh to > start the jobs on all the remote nodes. I believe that is the command > that is failing. There are two things you can try to get more info > depending on the version of Open MPI. With version 1.2, you can try > this to get more information. > > |--mca pls_gridengine_verbose 1| > This did not look like it gave me any more info. > With Open MPI 1.3.2 and later the verbose flag will not help. But > instead, you can disable the use of qrsh and instead use rsh/ssh to > start the remote jobs. > > --mca plm_rsh_disable_qrsh 1 > Tha give me PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required environment variable: MPIRUN_RANK PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required environment variable: MPIRUN_RANK -- Ray Muno University of Minnesota
[OMPI users] OpenMPI and SGE
We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily). Scheduling is done through SGE. MPI communication is over InfiniBand. We have been running with this setup for over 9 months. Last week, all user jobs stopped executing (cluster load dropped to zero). User can schedule jobs but when they try to execute, they get errors of the form: - [compute-2-5.local:12321] ERROR: The daemon exited unexpectedly with status 1. error: commlib error: access denied (client IP resolved to host name "". This is not identical to clients host name "") error: executing task of job 11901 failed: failed sending task to execd@compute-5-9.local: can't find connection [compute-2-5.local:12321] ERROR: A daemon on node compute-5-9.local failed to start as expected. [compute-2-5.local:12321] ERROR: There may be more information available from [compute-2-5.local:12321] ERROR: the 'qstat -t' command on the Grid Engine tasks. [compute-2-5.local:12321] ERROR: If the problem persists, please restart the [compute-2-5.local:12321] ERROR: Grid Engine PE job [compute-2-5.local:12321] ERROR: The daemon exited unexpectedly with status 1. error: commlib error: access denied (client IP resolved to host name "". This is not identical to clients host name "") When run interactively, we see. - error: commlib error: access denied (client IP resolved to host name "". This is not identical to clients host name "") error: executing task of job 12094 failed: failed sending task to execd@compute-4-11.local: can't find connection -- A daemon (pid 4938) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished - This seems to be an error with SGE but it is only affecting OpenMPI. User can successfully launch and run jobs with MVAPICH. Some changes were made to the ROCKS setup that may have caused this but I have not found where the actual problems lies. -- Ray Muno University of Minnesota
[OMPI users] OpenMPI 1.3RC2 job startup issue
We have been happily running under OpenMPI 1.2 on our cluster unitil recently. It is 2200 processors (8 way Opteron) , Qlogic IB connected. We have had issues starting larger jobs (600+ processors). There seemed to be some indication that OpenMPI may solve our problems. It built with no problem and installed. Users can compile programs. When they tried to run, they got the attached output. Are we missing something obvious? This is a Rocks cluster with jobs scheduled through SGE. = $ mpirun -np 1024 program [compute-2-6.local:32580] Error: unknown option "--daemonize" Usage: orted [OPTION]... Start an Open RTE Daemon --bootproxy Run as boot proxy for -d|--debug Debug the OpenRTE -d|--spinHave the orted spin until we can connect a debugger to it --debug-daemons Enable debugging of OpenRTE daemons --debug-daemons-file Enable debugging of OpenRTE daemons, storing output in files --gprreplicaRegistry contact information. -h|--helpThis help message --mpi-call-yield Have MPI (or similar) applications call yield when idle --name Set the orte process name --no-daemonizeDon't daemonize into the background --nodename Node name as specified by host/resource description. --ns-ndsset sds/nds component to use for daemon (normally not needed) --nsreplica Name service contact information. --num_procs Set the number of process in this job --persistent Remain alive after the application process completes --report-uriReport this process' uri on indicated pipe --scope Set restrictions on who can connect to this universe --seedHost replicas for the core universe services --set-sid Direct the orted to separate from the current session --tmpdirSet the root for the session directory tree --universe Set the universe name as username@hostname:universe_name for this application --vpid_startSet the starting vpid for this job -- A daemon (pid 4151) died unexpectedly with status 251 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- -- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -- compute-5-15.local - daemon did not report back when launched compute-5-35.local - daemon did not report back when launched compute-4-8.local - daemon did not report back when launched compute-7-2.local - daemon did not report back when launched compute-2-6.local - daemon did not report back when launched compute-6-28.local - daemon did not report back when launched compute-6-35.local - daemon did not report back when launched compute-6-25.local compute-6-26.local compute-2-19.local - daemon did not report back when launched compute-6-37.local - daemon did not report back when launched compute-6-12.local - daemon did not report back when launched compute-2-36.local - daemon did not report back when launched compute-7-5.local - daemon did not report back when launched compute-7-23.local - daemon did not report back when launched ======== -- Ray Muno University of Minnesota
Re: [OMPI users] /dev/shm
John Hearns wrote: 2008/11/19 Ray Muno <m...@aem.umn.edu> Thought I would revisit this one. We are still having issues with this. It is not clear to me what is leaving the user files behind in /dev/shm. This is not something users are doing directly, they are just compiling their code directly with mpif90 (from OpenMPI), using various compilers. Compilers in use are PGI, Intel, SunStudio and Pathscale. It looks like every job run leaves something behind in /dev/shm and it slowly fills up. We are just clearing these out at this point. Could you not run ipcs with the -p flag every few minutes and try to figure out what the processes are which are leaving these? (by that I mean catch them when they are live, and the creating process is still on the system) OK, what should I be seeing when I run "ipcs -p"? I have a set of nodes that has a user job running on it. It wrote a file in /dev/shm when it started. The job is still running. I see # ipcs -p -- Shared Memory Creator/Last-op shmid owner cpid lpid -- Message Queues PIDs msqid owner lspid lrpid -- Ray Muno
Re: [OMPI users] /dev/shm
Ralph Castain wrote: Hi Ray Are the jobs that leave files behind terminating normally or aborting? Are there any warnings/error messages out of mpirun? Just trying to determine if this is an abnormal termination issue or a bug in OMPI itself. Ralph As far as I know, they are from jobs that are terminating normally. I have had no notice from users of errors. We are still trying to get a handle on this. With 30 users and 280+ nodes, it is something we have not tracked down completely. We are just seeing the after effects of the stale files getting left behind. At some point, new jobs do not launch. -- Ray Muno
Re: [OMPI users] /dev/shm
Thought I would revisit this one. We are still having issues with this. It is not clear to me what is leaving the user files behind in /dev/shm. This is not something users are doing directly, they are just compiling their code directly with mpif90 (from OpenMPI), using various compilers. Compilers in use are PGI, Intel, SunStudio and Pathscale. It looks like every job run leaves something behind in /dev/shm and it slowly fills up. We are just clearing these out at this point. Jeff Squyres wrote: That is odd. Is your user's app crashing or being forcibly killed? The ORTE daemon that is silently launched in v1.2 jobs should ensure that files under /tmp/openmpi-sessions-@ are removed. On Nov 10, 2008, at 2:14 PM, Ray Muno wrote: Brock Palen wrote: on most systems /dev/shm is limited to half the physical ram. Was the user someone filling up /dev/shm so there was no space? The problem is there is a large collection of stale files left in there by the users that have run on that node (Rocks based cluster). I am trying to determine why they are left behind. -- Ray Muno University of Minnesota Aerospace Engineering and Mechanics
Re: [OMPI users] Trouble with OpenMPI and Intel 10.1 compilers
Jeff Squyres wrote: On Nov 11, 2008, at 2:54 PM, Ray Muno wrote: See http://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0. OK, that tells me lots of things ;-) Should I be running configure with --with-wrapper-cflags, --with-wrapper-fflags etc, set to include -i_dynamic If you want to, sure. As a guideline, the wrapper compilers put in the minimum number of flags to get the job done -- any other flags are a local policy decision (and we wouldn't presume to make those for you). FWIW, the warning is harmless, but I can see how you wouldn't want users to see it / be alarmed by it. If we have them compile with -i_dynamic, are there any implications? -- Ray Muno
Re: [OMPI users] Trouble with OpenMPI and Intel 10.1 compilers
Jeff Squyres wrote: See http://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0. OK, that tells me lots of things ;-) Should I be running configure with --with-wrapper-cflags, --with-wrapper-fflags etc, set to include -i_dynamic -- Ray Muno
Re: [OMPI users] Trouble with OpenMPI and Intel 10.1 compilers
Steve Jones wrote: Are you adding -i_dynamic to base flags, or something different? Steve I brought this up to see if something should be changed with the install, For now, I am leaving that to users. -- Ray Muno
Re: [OMPI users] Trouble with OpenMPI and Intel 10.1 compilers
Gus Correa wrote: Hi Ray and list I have Intel ifort 10.1.017 on a Rocks 4.3 cluster. The OpenMPI compiler wrappers (i.e. "opal_wrapper") work fine, and find the shared libraries (Intel or other) without a problem. My guess is that this is not an OpenMPI problem, but an Intel compiler environment glitch. I wonder if your .profile/.tcshrc/.bashrc files initialize the Intel compiler environment properly. I.e., "source /share/apps/intel/fce/10.1.018/bin/ifortvars.csh" or similar, to get the right Intel environment variables inserted on PATH, LD_LIBRARY_PATH, MANPATH. and INTEL_LICENSE_FILE. Not doing this caused trouble for me in the past. Double or inconsistent assignment of LD_LIBRARY_PATH and PATH (say on the ifortvars.csh and on the user login files) also caused conflicts. I am not sure if this needs to be done before you configure and install OpenMPI, but doing it after you build OpenMPI may still be OK. I hope this helps, Gus Correa That does help. I confirmed that what I added needs to be in the environment (LD_LIBRARY_PATH). Must have missed that in the docs. I have now added the appropriate variables to our modules environment. Seems strange that OpenMPI built without these being set at all. I could also compile test codes with the compilers, just not with mpicc and mpif90. -Ray Muno
Re: [OMPI users] Trouble with OpenMPI and Intel 10.1 compilers
Ray Muno wrote: I updated the LD_LIBRARY_PATH to point to the directories that contain the installed copies of libimf.so. (this is not something I have not had to do for other compiler/OpenMpi combinations) How about... (this is not something I have had to do for other compiler/OpenMpi combinations) -Ray Muno
[OMPI users] Trouble with OpenMPI and Intel 10.1 compilers
We have recently installed the Intel 10,1 compiler suite on our cluster. I built OpenMPI (1.2.7 and 1.2.8) with ./configure CC=icc CXX=icpc F77=ifort FC=ifort It configures, builds and installs. However, the MPI compiler drivers (mpicc, mpif90, etc) fail immediately with error of the sort mpif90: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory I updated the LD_LIBRARY_PATH to point to the directories that contain the installed copies of libimf.so. (this is not something I have not had to do for other compiler/OpenMpi combinations) At that point, the program will compile but I get warnings like: [muno@titan ~]$ mpif90 test.f /share/apps/Intel/fce/10.1.018/lib/libimf.so: warning: warning: feupdateenv is not implemented and will always fail In a google search, I found a reference to this in the OpenMPI lists. When I follow the link, it is a different thread. Searching the OpenMPI lists from the web page does not find any matches. Strange. I found some references to this at some other sites using OpenMPI on clusters and they said to use -i_dynamic on the compile line. This removes the warning. Is there something I should be doing at OpenMPI configure time to take care of these issues? -- Ray Muno University of Minnesota Aerospace Engineering
Re: [OMPI users] /dev/shm
Jeff Squyres wrote: That is odd. Is your user's app crashing or being forcibly killed? The ORTE daemon that is silently launched in v1.2 jobs should ensure that files under /tmp/openmpi-sessions-@ are removed. It looks like I see orphaned directories under /tmp/openmpi* as well. -- Ray Muno
Re: [OMPI users] /dev/shm
Brock Palen wrote: on most systems /dev/shm is limited to half the physical ram. Was the user someone filling up /dev/shm so there was no space? The problem is there is a large collection of stale files left in there by the users that have run on that node (Rocks based cluster). I am trying to determine why they are left behind. -- Ray Muno University of Minnesota Aerospace Engineering and Mechanics
[OMPI users] /dev/shm
We are running OpenMPI 1.2.7. Now that we have been running for a while, we are getting messages of the sort. node: Unable to allocate shared memory for intra-node messaging. node: Delete stale shared memory files in /dev/shm. MPI process terminated unexpectedly If the user deletes the stale files, they can run. -- Ray Muno University of Minnesota Aerospace Engineering and Mechanics
Re: [OMPI users] Performance: MPICH2 vs OpenMPI
I would be interested in what others have to say about this as well. We have been doing a bit of performance testing since we are deploying a new cluster and it is our first InfiniBand based set up. In our experience, so far, OpenMPI is coming out faster than MVAPICH. Comparisons were made with different compilers, PGI and Pathscale. We do not have a running implementation of OpenMPI with SunStudio compilers. Our tests were with actual user codes running on up to 600 processors so far. Sangamesh B wrote: > Hi All, > >I wanted to switch from mpich2/mvapich2 to OpenMPI, as OpenMPI > supports both ethernet and infiniband. Before doing that I tested an > application 'GROMACS' to compare the performance of MPICH2 & OpenMPI. Both > have been compiled with GNU compilers. > > After this benchmark, I came to know that OpenMPI is slower than MPICH2. > > This benchmark is run on a AMD dual core, dual opteron processor. Both have > compiled with default configurations. > > The job is run on 2 nodes - 8 cores. > > OpenMPI - 25 m 39 s. > MPICH2 - 15 m 53 s. > > Any comments ..? > > Thanks, > Sangamesh > -Ray Muno Aerospace Engineering.