If you look at your configure line, you forgot to include --with-pmi=<path-to-slurm-pmi-lib>. We don't build the Slurm PMI support by default due to the GPL licensing issues - you have to point at it.
> On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users > <users@lists.open-mpi.org> wrote: > > Hi, > > we have 2 DGX A100 machines and I'm trying to run nccl-tests > (https://github.com/NVIDIA/nccl-tests) in various ways to understand how > things work. > > I can successfully run nccl-tests on both nodes with Slurm (via srun) when > built directly on a compute node against Open MPI 4.1.2 coming from a NVIDIA > deb package. > > I can also build nccl-tests in a lmod environment with NVIDIA HPC SDK 21.09 > with Open MPI 4.0.5. When I run this with Slurm (via srun) I get the > following message: > > [foo:1140698] OPAL ERROR: Error in file > ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112 > > -------------------------------------------------------------------------- > > The application appears to have been direct launched using "srun", > > but OMPI was not built with SLURM's PMI support and therefore cannot > > execute. There are several options for building PMI support under > > SLURM, depending upon the SLURM version you are using: > > > > version 16.05 or later: you can use SLURM's PMIx support. This > > requires that you configure and build SLURM --with-pmix. > > > > Versions earlier than 16.05: you must use either SLURM's PMI-1 or > > PMI-2 support. SLURM builds PMI-1 by default, or you can manually > > install PMI-2. You must then build Open MPI using --with-pmi pointing > > to the SLURM PMI library location. > > > > Please configure as appropriate and try again. > > -------------------------------------------------------------------------- > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > > > When I look at PMI support in both Open MPI packages I don't see a lot of > difference: > > “/usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info --parsable | grep -i pmi”: > > mca:pmix:isolated:version:“mca:2.1.0” > mca:pmix:isolated:version:“api:2.0.0” > mca:pmix:isolated:version:“component:4.1.2” > mca:pmix:flux:version:“mca:2.1.0” > mca:pmix:flux:version:“api:2.0.0” > mca:pmix:flux:version:“component:4.1.2” > mca:pmix:pmix3x:version:“mca:2.1.0” > mca:pmix:pmix3x:version:“api:2.0.0” > mca:pmix:pmix3x:version:“component:4.1.2” > mca:ess:pmi:version:“mca:2.1.0” > mca:ess:pmi:version:“api:3.0.0” > mca:ess:pmi:version:“component:4.1.2” > > “/msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/bin/ompi_info --parsable | > grep -i pmi”: > > mca:pmix:isolated:version:“mca:2.1.0” > mca:pmix:isolated:version:“api:2.0.0” > mca:pmix:isolated:version:“component:4.0.5” > mca:pmix:pmix3x:version:“mca:2.1.0” > mca:pmix:pmix3x:version:“api:2.0.0” > mca:pmix:pmix3x:version:“component:4.0.5” > mca:ess:pmi:version:“mca:2.1.0” > mca:ess:pmi:version:“api:3.0.0” > mca:ess:pmi:version:“component:4.0.5” > > I don't know if that's the right place I'm looking at, but to me it seems > it's an Open MPI topic, this is why I'm posting here. Please explain what's > missing in my case. > > Slurm is 21.08.5. "MpiDefault" in slurm.conf is "pmix". > Both Open MPI versions have Slurm support. > > thx > Matthias