You should probably ask them - I see in the top one that they used a platform 
file, which likely had the missing option in it. The bottom one does not use 
that platform file, so it was probably missed.


> On Jan 24, 2022, at 7:17 AM, Matthias Leopold via users 
> <users@lists.open-mpi.org> wrote:
> 
> To be sure: both packages were provided by NVIDIA (I didn't compile them)
> 
> Am 24.01.22 um 16:13 schrieb Matthias Leopold:
>> Thx, but I don't see this option in any of the two versions:
>> /usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info (works with slurm):
>>   Configure command line: '--build=x86_64-linux-gnu' '--prefix=/usr' 
>> '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' 
>> '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' 
>> '--disable-silent-rules' '--libexecdir=${prefix}/lib/openmpi' 
>> '--disable-maintainer-mode' '--disable-dependency-tracking' 
>> '--prefix=/usr/mpi/gcc/openmpi-4.1.2a1' 
>> '--with-platform=contrib/platform/mellanox/optimized'
>> lmod ompi (doesn't work with slurm)
>>   Configure command line: 
>> '--prefix=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1' 
>> 'CC=nvc -nomp' 'CXX=nvc++ -nomp' 'FC=nvfortran -nomp' 'CFLAGS=-O1 -fPIC -c99 
>> -tp p7-64' 'CXXFLAGS=-O1 -fPIC -tp p7-64' 'FCFLAGS=-O1 -fPIC -tp p7-64' 
>> 'LD=ld' '--enable-shared' '--enable-static' '--without-tm' 
>> '--enable-mpi-cxx' '--disable-wrapper-runpath' 
>> '--enable-mpirun-prefix-by-default' '--with-libevent=internal' 
>> '--with-slurm' '--without-libnl' '--enable-mpi1-compatibility' 
>> '--enable-mca-no-build=btl-uct' '--without-verbs' 
>> '--with-cuda=/proj/cuda/11.0/Linux_x86_64' 
>> '--with-ucx=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1' 
>> Matthias
>> Am 24.01.22 um 15:59 schrieb Ralph Castain via users:
>>> If you look at your configure line, you forgot to include 
>>> --with-pmi=<path-to-slurm-pmi-lib>. We don't build the Slurm PMI support by 
>>> default due to the GPL licensing issues - you have to point at it.
>>> 
>>> 
>>>> On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users 
>>>> <users@lists.open-mpi.org> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> we have 2 DGX A100 machines and I'm trying to run nccl-tests 
>>>> (https://github.com/NVIDIA/nccl-tests) in various ways to understand how 
>>>> things work.
>>>> 
>>>> I can successfully run nccl-tests on both nodes with Slurm (via srun) when 
>>>> built directly on a compute node against Open MPI 4.1.2 coming from a 
>>>> NVIDIA deb package.
>>>> 
>>>> I can also build nccl-tests in a lmod environment with NVIDIA HPC SDK 
>>>> 21.09 with Open MPI 4.0.5. When I run this with Slurm (via srun) I get the 
>>>> following message:
>>>> 
>>>> [foo:1140698] OPAL ERROR: Error in file 
>>>> ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
>>>> 
>>>> -------------------------------------------------------------------------- 
>>>> 
>>>> The application appears to have been direct launched using "srun",
>>>> 
>>>> but OMPI was not built with SLURM's PMI support and therefore cannot
>>>> 
>>>> execute. There are several options for building PMI support under
>>>> 
>>>> SLURM, depending upon the SLURM version you are using:
>>>> 
>>>> 
>>>> 
>>>>   version 16.05 or later: you can use SLURM's PMIx support. This
>>>> 
>>>>   requires that you configure and build SLURM --with-pmix.
>>>> 
>>>> 
>>>> 
>>>>   Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>>>> 
>>>>   PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>>>> 
>>>>   install PMI-2. You must then build Open MPI using --with-pmi pointing
>>>> 
>>>>   to the SLURM PMI library location.
>>>> 
>>>> 
>>>> 
>>>> Please configure as appropriate and try again.
>>>> 
>>>> -------------------------------------------------------------------------- 
>>>> 
>>>> *** An error occurred in MPI_Init
>>>> 
>>>> *** on a NULL communicator
>>>> 
>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>> 
>>>> ***    and potentially your MPI job)
>>>> 
>>>> 
>>>> 
>>>> When I look at PMI support in both Open MPI packages I don't see a lot of 
>>>> difference:
>>>> 
>>>> “/usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info --parsable | grep -i pmi”:
>>>> 
>>>> mca:pmix:isolated:version:“mca:2.1.0”
>>>> mca:pmix:isolated:version:“api:2.0.0”
>>>> mca:pmix:isolated:version:“component:4.1.2”
>>>> mca:pmix:flux:version:“mca:2.1.0”
>>>> mca:pmix:flux:version:“api:2.0.0”
>>>> mca:pmix:flux:version:“component:4.1.2”
>>>> mca:pmix:pmix3x:version:“mca:2.1.0”
>>>> mca:pmix:pmix3x:version:“api:2.0.0”
>>>> mca:pmix:pmix3x:version:“component:4.1.2”
>>>> mca:ess:pmi:version:“mca:2.1.0”
>>>> mca:ess:pmi:version:“api:3.0.0”
>>>> mca:ess:pmi:version:“component:4.1.2”
>>>> 
>>>> “/msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/bin/ompi_info --parsable 
>>>> | grep -i pmi”:
>>>> 
>>>> mca:pmix:isolated:version:“mca:2.1.0”
>>>> mca:pmix:isolated:version:“api:2.0.0”
>>>> mca:pmix:isolated:version:“component:4.0.5”
>>>> mca:pmix:pmix3x:version:“mca:2.1.0”
>>>> mca:pmix:pmix3x:version:“api:2.0.0”
>>>> mca:pmix:pmix3x:version:“component:4.0.5”
>>>> mca:ess:pmi:version:“mca:2.1.0”
>>>> mca:ess:pmi:version:“api:3.0.0”
>>>> mca:ess:pmi:version:“component:4.0.5”
>>>> 
>>>> I don't know if that's the right place I'm looking at, but to me it seems 
>>>> it's an Open MPI topic, this is why I'm posting here. Please explain 
>>>> what's missing in my case.
>>>> 
>>>> Slurm is 21.08.5. "MpiDefault" in slurm.conf is "pmix".
>>>> Both Open MPI versions have Slurm support.
>>>> 
>>>> thx
>>>> Matthias
>>> 
>>> 
> 
> -- 
> Matthias Leopold
> IT Systems & Communications
> Medizinische Universität Wien
> Spitalgasse 23 / BT 88 / Ebene 00
> A-1090 Wien
> Tel: +43 1 40160-21241
> Fax: +43 1 40160-921200

Reply via email to