Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

Paul Edmon Tue, 26 Jan 2021 13:05:45 -0800

That is correct. I think NVML has some additional features but in termsof actually scheduling them what you have should work. They will just betreated as normal gres resources.


-Paul Edmon-


On 1/26/2021 3:55 PM, Ole Holm Nielsen wrote:

On 26-01-2021 21:36, Paul Edmon wrote:
You can include gpu's as gres in slurm with out compilingspecifically against nvml. You only really need to do that if youwant to use the autodetection features that have been built into theslurm. We don't really use any of those features at our site, weonly started building against nvml to future proof ourselves forwhen/if those features become relevant to us.
Thanks for this clarification about not actually *requiring* theNVIDIA NVML library in the Slurm build!
Now I'm seeing this description in https://slurm.schedmd.com/gres.htmlabout automatic GPU configuration by Slurm:
If AutoDetect=nvml is set in gres.conf, and the NVIDIA ManagementLibrary (NVML) is installed on the node and was found during Slurmconfiguration, configuration details will automatically be filled infor any system-detected NVIDIA GPU. This removes the need toexplicitly configure GPUs in gres.conf, though the Gres= line inslurm.conf is still required in order to tell slurmctld how many GRESto expect.
I have defined our GPUs manually in gres.conf with File=/dev/nvidia?lines, so it would seem that this obviates the need for NVML. Is thisthe correct conclusion?
/Ole
To me at least it would be nicer if there was a less hacky way ofgetting it to do that. Arguably Slurm should dynamically linkagainst the libs it needs or not depending on the node. We hit thisissue with Lustre/IB as well where you have to roll a separate slurmfor each type of node you have if you want these which is hardly ideal.
-Paul Edmon-

On 1/26/2021 3:24 PM, Robert Kudyba wrote:
You all might be interested in a patch to the SPEC file, to not makethe slurm RPMs depend on libnvidia-ml.so, even if it's been enabledat configure time. Seehttps://bugs.schedmd.com/show_bug.cgi?id=7919#c3<https://bugs.schedmd.com/show_bug.cgi?id=7919#c3>
On Tue, Jan 26, 2021 at 3:17 PM Paul Raines<rai...@nmr.mgh.harvard.edu <mailto:rai...@nmr.mgh.harvard.edu>> wrote:
    You should check your jobs that allocated GPUs and make sure
CUDA_VISIBLE_DEVICES is being set in the environment. This is asign you GPU support is not really there but SLURM is just doing"generic"
    resource assignment.

    I have both GPU and non-GPU nodes.  I build SLURM rpms twice. Once
    on a
    non-GPU node and use those RPMs to install on the non-GPU nodes.
    Then build
    again on the GPU node where CUDA is installed via the NVIDIA CUDA
    YUM repo
    rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm
    nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to
    the default
    RPM SPEC is needed.  I just run

       rpmbuild --tb slurm-20.11.3.tar.bz2

    You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml'
    and see
that /usr/lib64/slurm/gpu_nvml.so only exists on the one builton the
    GPU node.

    -- Paul Raines
(https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e=
<https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e=>
    )



    On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote:

    > In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
    >>  Personally, I think it's good that Slurm RPMs are now
    available through
    >>  EPEL, although I won't be able to use them, and I'm sure many
    people on
    >>  the list won't be able to either, since licensing issues
    prevent them from
    >>  providing support for NVIDIA drivers, so those of us with GPUs
    on our
    >>  clusters will still have to compile Slurm from source to
    include NVIDIA
    >>  GPU support.
    >
    > We're running Slurm 20.02.6 and recently added some NVIDIA GPU
    nodes.
    > The Slurm GPU documentation seems to be
    >
https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e=
<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e=>

    > We don't seem to have any problems scheduling jobs on GPUs, even
    though our
    > Slurm RPM build host doesn't have any NVIDIA software installed,
    as shown by
    > the command:
    > $ ldconfig -p | grep libnvidia-ml
    >
    > I'm curious about Prentice's statement about needing NVIDIA
    libraries to be
    > installed when building Slurm RPMs, and I read the discussion in
    bug 9525,
    >
https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e=
<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e=>
> from which it seems that the problem was fixed in 20.02.6 and20.11.
    >
    > Question: Is there anything special that needs to be done when
    building Slurm
    > RPMs with NVIDIA GPU support?
    >
    > Thanks,
    > Ole

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

Reply via email to