If you look at your configure line, you forgot to include 
--with-pmi=<path-to-slurm-pmi-lib>. We don't build the Slurm PMI support by 
default due to the GPL licensing issues - you have to point at it.


> On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users 
> <users@lists.open-mpi.org> wrote:
> 
> Hi,
> 
> we have 2 DGX A100 machines and I'm trying to run nccl-tests 
> (https://github.com/NVIDIA/nccl-tests) in various ways to understand how 
> things work.
> 
> I can successfully run nccl-tests on both nodes with Slurm (via srun) when 
> built directly on a compute node against Open MPI 4.1.2 coming from a NVIDIA 
> deb package.
> 
> I can also build nccl-tests in a lmod environment with NVIDIA HPC SDK 21.09 
> with Open MPI 4.0.5. When I run this with Slurm (via srun) I get the 
> following message:
> 
> [foo:1140698] OPAL ERROR: Error in file 
> ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
> 
> --------------------------------------------------------------------------
> 
> The application appears to have been direct launched using "srun",
> 
> but OMPI was not built with SLURM's PMI support and therefore cannot
> 
> execute. There are several options for building PMI support under
> 
> SLURM, depending upon the SLURM version you are using:
> 
> 
> 
>  version 16.05 or later: you can use SLURM's PMIx support. This
> 
>  requires that you configure and build SLURM --with-pmix.
> 
> 
> 
>  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
> 
>  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
> 
>  install PMI-2. You must then build Open MPI using --with-pmi pointing
> 
>  to the SLURM PMI library location.
> 
> 
> 
> Please configure as appropriate and try again.
> 
> --------------------------------------------------------------------------
> 
> *** An error occurred in MPI_Init
> 
> *** on a NULL communicator
> 
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> 
> ***    and potentially your MPI job)
> 
> 
> 
> When I look at PMI support in both Open MPI packages I don't see a lot of 
> difference:
> 
> “/usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info --parsable | grep -i pmi”:
> 
> mca:pmix:isolated:version:“mca:2.1.0”
> mca:pmix:isolated:version:“api:2.0.0”
> mca:pmix:isolated:version:“component:4.1.2”
> mca:pmix:flux:version:“mca:2.1.0”
> mca:pmix:flux:version:“api:2.0.0”
> mca:pmix:flux:version:“component:4.1.2”
> mca:pmix:pmix3x:version:“mca:2.1.0”
> mca:pmix:pmix3x:version:“api:2.0.0”
> mca:pmix:pmix3x:version:“component:4.1.2”
> mca:ess:pmi:version:“mca:2.1.0”
> mca:ess:pmi:version:“api:3.0.0”
> mca:ess:pmi:version:“component:4.1.2”
> 
> “/msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/bin/ompi_info --parsable | 
> grep -i pmi”:
> 
> mca:pmix:isolated:version:“mca:2.1.0”
> mca:pmix:isolated:version:“api:2.0.0”
> mca:pmix:isolated:version:“component:4.0.5”
> mca:pmix:pmix3x:version:“mca:2.1.0”
> mca:pmix:pmix3x:version:“api:2.0.0”
> mca:pmix:pmix3x:version:“component:4.0.5”
> mca:ess:pmi:version:“mca:2.1.0”
> mca:ess:pmi:version:“api:3.0.0”
> mca:ess:pmi:version:“component:4.0.5”
> 
> I don't know if that's the right place I'm looking at, but to me it seems 
> it's an Open MPI topic, this is why I'm posting here. Please explain what's 
> missing in my case.
> 
> Slurm is 21.08.5. "MpiDefault" in slurm.conf is "pmix".
> Both Open MPI versions have Slurm support.
> 
> thx
> Matthias


Reply via email to