Re: [slurm-users] HELP: error between compilation and execution on gpu cluster

Saksham Pande 5-Year IDD Physics Fri, 19 May 2023 07:51:08 -0700

Hello,
is writing the --with-pmi flag sufficient or do I have to write it in the
form --with-pmi=<some directory>, where it points to a directory, and if
so, where? Slightly confused by the syntax provided in the documentation.


[sakshamp.phy20.itbhu@login2]$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2

If any more info is needed for context, please let me know.
Regards

On Fri, May 19, 2023 at 4:32 PM Saksham Pande 5-Year IDD Physics <
saksham.pande.ph...@itbhu.ac.in> wrote:

> Thank you for responding.
> The output of ompi_info regarding configuration is
>
> Configure command line: '--build=x86_64-redhat-linux-gnu'
>                           '--host=x86_64-redhat-linux-gnu'
>                           '--program-prefix='
> '--disable-dependency-tracking'
>                           '--prefix=/usr/mpi/gcc/openmpi-4.0.2a1'
>                           '--exec-prefix=/usr/mpi/gcc/openmpi-4.0.2a1'
>                           '--bindir=/usr/mpi/gcc/openmpi-4.0.2a1/bin'
>                           '--sbindir=/usr/mpi/gcc/openmpi-4.0.2a1/sbin'
>                           '--sysconfdir=/usr/mpi/gcc/openmpi-4.0.2a1/etc'
>                           '--datadir=/usr/mpi/gcc/openmpi-4.0.2a1/share'
>
> '--includedir=/usr/mpi/gcc/openmpi-4.0.2a1/include'
>                           '--libdir=/usr/mpi/gcc/openmpi-4.0.2a1/lib64'
>
> '--libexecdir=/usr/mpi/gcc/openmpi-4.0.2a1/libexec'
>                           '--localstatedir=/var'
> '--sharedstatedir=/var/lib'
>                           '--mandir=/usr/mpi/gcc/openmpi-4.0.2a1/share/man'
>
> '--infodir=/usr/mpi/gcc/openmpi-4.0.2a1/share/info'
>
> '--with-platform=contrib/platform/mellanox/optimized'
>
> BUT
>                 MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.2)
>                 MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component
> v4.0.2)
>                 MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.0.2)
>                  MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.0.2)
>                  MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
> v4.0.2)
>                  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.2)
> are also present and contain references to pmi and slurm.
>
>
>
> On Fri, May 19, 2023 at 2:48 PM Juergen Salk <juergen.s...@uni-ulm.de>
> wrote:
>
>> Hi,
>>
>> I am not sure if this related to GPUs. I rather think the issue has to do
>> with
>> how your OpenMPI has been built.
>>
>> What does ompi_info command show? Look for "Configure command line" in
>> the output. Does this include '--with-slurm' and '--with-pmi' flags?
>>
>> To my very best knowledge, both flags need to be set for OpenMPI to
>> work with srun.
>>
>> Also see:
>>
>> https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps
>>
>> https://slurm.schedmd.com/mpi_guide.html#open_mpi
>>
>> Best regards
>> Jürgen
>>
>>
>> * Saksham Pande 5-Year IDD Physics <saksham.pande.ph...@itbhu.ac.in>
>> [230519 07:42]:
>> > Hi everyone,
>> > I am trying to run a simulation software on slurm using openmpi-4.1.1
>> and
>> > cuda/11.1.
>> > On executing, I get the following error:
>> >
>> > srun --mpi=pmi2 --nodes=1 --ntasks-per-node=5 --partition=gpu
>> --gres=gpu:1
>> > --time=02:00:00 --pty bash -i
>> > ./<execultable>
>> >
>> >
>> >
>> ```._____________________________________________________________________________________
>> > |
>> > | Initial checks...
>> > | All good.
>> >
>> |_____________________________________________________________________________________
>> > [gpu008:162305] OPAL ERROR: Not initialized in file pmix3x_client.c at
>> line
>> > 112
>> >
>> --------------------------------------------------------------------------
>> > The application appears to have been direct launched using "srun",
>> > but OMPI was not built with SLURM's PMI support and therefore cannot
>> > execute. There are several options for building PMI support under
>> > SLURM, depending upon the SLURM version you are using:
>> >
>> >   version 16.05 or later: you can use SLURM's PMIx support. This
>> >   requires that you configure and build SLURM --with-pmix.
>> >
>> >   Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>> >   PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>> >   install PMI-2. You must then build Open MPI using --with-pmi pointing
>> >   to the SLURM PMI library location.
>> >
>> > Please configure as appropriate and try again.
>> >
>> --------------------------------------------------------------------------
>> > *** An error occurred in MPI_Init
>> > *** on a NULL communicator
>> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> > ***    and potentially your MPI job)
>> > [gpu008:162305] Local abort before MPI_INIT completed completed
>> > successfully, but am not able to aggregate error messages, and not able
>> to
>> > guarantee that all other processes were killed!```
>> >
>> >
>> > using the following modules: gcc/10.2 openmpi/4.1.1 cuda/11.1
>> > on using which mpic++ or mpirun or nvcc, I get the module paths only,
>> which
>> > looks correct.
>> > I also changed the $PATH and $LD_LIBRARY_PATH based on ldd <executable>,
>> > but still the same error.
>> >
>> > [sakshamp.phy20.itbhu@login2 menura]$ srun --mpi=list
>> > srun: MPI types are...
>> > srun: cray_shasta
>> > srun: none
>> > srun: pmi2
>> >
>> > What should I do from here, been stuck on this error for 6 days now? If
>> > there is any build difference, I will have to tell the sysadmin.
>> > Since there is an openmpi pairing error with slurm, are there other
>> error I
>> > could expect between cuda and openmpi?
>> >
>> > Thanks
>>
>>
>>

Re: [slurm-users] HELP: error between compilation and execution on gpu cluster

Reply via email to