You should probably ask them - I see in the top one that they used a platform file, which likely had the missing option in it. The bottom one does not use that platform file, so it was probably missed.
> On Jan 24, 2022, at 7:17 AM, Matthias Leopold via users > <users@lists.open-mpi.org> wrote: > > To be sure: both packages were provided by NVIDIA (I didn't compile them) > > Am 24.01.22 um 16:13 schrieb Matthias Leopold: >> Thx, but I don't see this option in any of the two versions: >> /usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info (works with slurm): >> Configure command line: '--build=x86_64-linux-gnu' '--prefix=/usr' >> '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' >> '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' >> '--disable-silent-rules' '--libexecdir=${prefix}/lib/openmpi' >> '--disable-maintainer-mode' '--disable-dependency-tracking' >> '--prefix=/usr/mpi/gcc/openmpi-4.1.2a1' >> '--with-platform=contrib/platform/mellanox/optimized' >> lmod ompi (doesn't work with slurm) >> Configure command line: >> '--prefix=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1' >> 'CC=nvc -nomp' 'CXX=nvc++ -nomp' 'FC=nvfortran -nomp' 'CFLAGS=-O1 -fPIC -c99 >> -tp p7-64' 'CXXFLAGS=-O1 -fPIC -tp p7-64' 'FCFLAGS=-O1 -fPIC -tp p7-64' >> 'LD=ld' '--enable-shared' '--enable-static' '--without-tm' >> '--enable-mpi-cxx' '--disable-wrapper-runpath' >> '--enable-mpirun-prefix-by-default' '--with-libevent=internal' >> '--with-slurm' '--without-libnl' '--enable-mpi1-compatibility' >> '--enable-mca-no-build=btl-uct' '--without-verbs' >> '--with-cuda=/proj/cuda/11.0/Linux_x86_64' >> '--with-ucx=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1' >> Matthias >> Am 24.01.22 um 15:59 schrieb Ralph Castain via users: >>> If you look at your configure line, you forgot to include >>> --with-pmi=<path-to-slurm-pmi-lib>. We don't build the Slurm PMI support by >>> default due to the GPL licensing issues - you have to point at it. >>> >>> >>>> On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users >>>> <users@lists.open-mpi.org> wrote: >>>> >>>> Hi, >>>> >>>> we have 2 DGX A100 machines and I'm trying to run nccl-tests >>>> (https://github.com/NVIDIA/nccl-tests) in various ways to understand how >>>> things work. >>>> >>>> I can successfully run nccl-tests on both nodes with Slurm (via srun) when >>>> built directly on a compute node against Open MPI 4.1.2 coming from a >>>> NVIDIA deb package. >>>> >>>> I can also build nccl-tests in a lmod environment with NVIDIA HPC SDK >>>> 21.09 with Open MPI 4.0.5. When I run this with Slurm (via srun) I get the >>>> following message: >>>> >>>> [foo:1140698] OPAL ERROR: Error in file >>>> ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112 >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> The application appears to have been direct launched using "srun", >>>> >>>> but OMPI was not built with SLURM's PMI support and therefore cannot >>>> >>>> execute. There are several options for building PMI support under >>>> >>>> SLURM, depending upon the SLURM version you are using: >>>> >>>> >>>> >>>> version 16.05 or later: you can use SLURM's PMIx support. This >>>> >>>> requires that you configure and build SLURM --with-pmix. >>>> >>>> >>>> >>>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or >>>> >>>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually >>>> >>>> install PMI-2. You must then build Open MPI using --with-pmi pointing >>>> >>>> to the SLURM PMI library location. >>>> >>>> >>>> >>>> Please configure as appropriate and try again. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> *** An error occurred in MPI_Init >>>> >>>> *** on a NULL communicator >>>> >>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> >>>> *** and potentially your MPI job) >>>> >>>> >>>> >>>> When I look at PMI support in both Open MPI packages I don't see a lot of >>>> difference: >>>> >>>> “/usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info --parsable | grep -i pmi”: >>>> >>>> mca:pmix:isolated:version:“mca:2.1.0” >>>> mca:pmix:isolated:version:“api:2.0.0” >>>> mca:pmix:isolated:version:“component:4.1.2” >>>> mca:pmix:flux:version:“mca:2.1.0” >>>> mca:pmix:flux:version:“api:2.0.0” >>>> mca:pmix:flux:version:“component:4.1.2” >>>> mca:pmix:pmix3x:version:“mca:2.1.0” >>>> mca:pmix:pmix3x:version:“api:2.0.0” >>>> mca:pmix:pmix3x:version:“component:4.1.2” >>>> mca:ess:pmi:version:“mca:2.1.0” >>>> mca:ess:pmi:version:“api:3.0.0” >>>> mca:ess:pmi:version:“component:4.1.2” >>>> >>>> “/msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/bin/ompi_info --parsable >>>> | grep -i pmi”: >>>> >>>> mca:pmix:isolated:version:“mca:2.1.0” >>>> mca:pmix:isolated:version:“api:2.0.0” >>>> mca:pmix:isolated:version:“component:4.0.5” >>>> mca:pmix:pmix3x:version:“mca:2.1.0” >>>> mca:pmix:pmix3x:version:“api:2.0.0” >>>> mca:pmix:pmix3x:version:“component:4.0.5” >>>> mca:ess:pmi:version:“mca:2.1.0” >>>> mca:ess:pmi:version:“api:3.0.0” >>>> mca:ess:pmi:version:“component:4.0.5” >>>> >>>> I don't know if that's the right place I'm looking at, but to me it seems >>>> it's an Open MPI topic, this is why I'm posting here. Please explain >>>> what's missing in my case. >>>> >>>> Slurm is 21.08.5. "MpiDefault" in slurm.conf is "pmix". >>>> Both Open MPI versions have Slurm support. >>>> >>>> thx >>>> Matthias >>> >>> > > -- > Matthias Leopold > IT Systems & Communications > Medizinische Universität Wien > Spitalgasse 23 / BT 88 / Ebene 00 > A-1090 Wien > Tel: +43 1 40160-21241 > Fax: +43 1 40160-921200