Hello, is writing the --with-pmi flag sufficient or do I have to write it in the form --with-pmi=<some directory>, where it points to a directory, and if so, where? Slightly confused by the syntax provided in the documentation.
[sakshamp.phy20.itbhu@login2]$ srun --mpi=list srun: MPI types are... srun: cray_shasta srun: none srun: pmi2 If any more info is needed for context, please let me know. Regards On Fri, May 19, 2023 at 4:32 PM Saksham Pande 5-Year IDD Physics < saksham.pande.ph...@itbhu.ac.in> wrote: > Thank you for responding. > The output of ompi_info regarding configuration is > > Configure command line: '--build=x86_64-redhat-linux-gnu' > '--host=x86_64-redhat-linux-gnu' > '--program-prefix=' > '--disable-dependency-tracking' > '--prefix=/usr/mpi/gcc/openmpi-4.0.2a1' > '--exec-prefix=/usr/mpi/gcc/openmpi-4.0.2a1' > '--bindir=/usr/mpi/gcc/openmpi-4.0.2a1/bin' > '--sbindir=/usr/mpi/gcc/openmpi-4.0.2a1/sbin' > '--sysconfdir=/usr/mpi/gcc/openmpi-4.0.2a1/etc' > '--datadir=/usr/mpi/gcc/openmpi-4.0.2a1/share' > > '--includedir=/usr/mpi/gcc/openmpi-4.0.2a1/include' > '--libdir=/usr/mpi/gcc/openmpi-4.0.2a1/lib64' > > '--libexecdir=/usr/mpi/gcc/openmpi-4.0.2a1/libexec' > '--localstatedir=/var' > '--sharedstatedir=/var/lib' > '--mandir=/usr/mpi/gcc/openmpi-4.0.2a1/share/man' > > '--infodir=/usr/mpi/gcc/openmpi-4.0.2a1/share/info' > > '--with-platform=contrib/platform/mellanox/optimized' > > BUT > MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.2) > MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component > v4.0.2) > MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.0.2) > MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.0.2) > MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component > v4.0.2) > MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.2) > are also present and contain references to pmi and slurm. > > > > On Fri, May 19, 2023 at 2:48 PM Juergen Salk <juergen.s...@uni-ulm.de> > wrote: > >> Hi, >> >> I am not sure if this related to GPUs. I rather think the issue has to do >> with >> how your OpenMPI has been built. >> >> What does ompi_info command show? Look for "Configure command line" in >> the output. Does this include '--with-slurm' and '--with-pmi' flags? >> >> To my very best knowledge, both flags need to be set for OpenMPI to >> work with srun. >> >> Also see: >> >> https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps >> >> https://slurm.schedmd.com/mpi_guide.html#open_mpi >> >> Best regards >> Jürgen >> >> >> * Saksham Pande 5-Year IDD Physics <saksham.pande.ph...@itbhu.ac.in> >> [230519 07:42]: >> > Hi everyone, >> > I am trying to run a simulation software on slurm using openmpi-4.1.1 >> and >> > cuda/11.1. >> > On executing, I get the following error: >> > >> > srun --mpi=pmi2 --nodes=1 --ntasks-per-node=5 --partition=gpu >> --gres=gpu:1 >> > --time=02:00:00 --pty bash -i >> > ./<execultable> >> > >> > >> > >> ```._____________________________________________________________________________________ >> > | >> > | Initial checks... >> > | All good. >> > >> |_____________________________________________________________________________________ >> > [gpu008:162305] OPAL ERROR: Not initialized in file pmix3x_client.c at >> line >> > 112 >> > >> -------------------------------------------------------------------------- >> > The application appears to have been direct launched using "srun", >> > but OMPI was not built with SLURM's PMI support and therefore cannot >> > execute. There are several options for building PMI support under >> > SLURM, depending upon the SLURM version you are using: >> > >> > version 16.05 or later: you can use SLURM's PMIx support. This >> > requires that you configure and build SLURM --with-pmix. >> > >> > Versions earlier than 16.05: you must use either SLURM's PMI-1 or >> > PMI-2 support. SLURM builds PMI-1 by default, or you can manually >> > install PMI-2. You must then build Open MPI using --with-pmi pointing >> > to the SLURM PMI library location. >> > >> > Please configure as appropriate and try again. >> > >> -------------------------------------------------------------------------- >> > *** An error occurred in MPI_Init >> > *** on a NULL communicator >> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> > *** and potentially your MPI job) >> > [gpu008:162305] Local abort before MPI_INIT completed completed >> > successfully, but am not able to aggregate error messages, and not able >> to >> > guarantee that all other processes were killed!``` >> > >> > >> > using the following modules: gcc/10.2 openmpi/4.1.1 cuda/11.1 >> > on using which mpic++ or mpirun or nvcc, I get the module paths only, >> which >> > looks correct. >> > I also changed the $PATH and $LD_LIBRARY_PATH based on ldd <executable>, >> > but still the same error. >> > >> > [sakshamp.phy20.itbhu@login2 menura]$ srun --mpi=list >> > srun: MPI types are... >> > srun: cray_shasta >> > srun: none >> > srun: pmi2 >> > >> > What should I do from here, been stuck on this error for 6 days now? If >> > there is any build difference, I will have to tell the sysadmin. >> > Since there is an openmpi pairing error with slurm, are there other >> error I >> > could expect between cuda and openmpi? >> > >> > Thanks >> >> >>