Assuming that you have the cuda drivers installed correctly (nvidia-smi for
instance),
You should create a gres.conf with just this line:
> AutoDetect=nvml
If that doesn’t automagically begin working, you can increase the verbosity of
slurmd with
> SlurmdDebug=debug2
It should then print a
Hi Will,
I appreciate your corroboration.
After we upgraded to 23.02.$latest, it seemed to make it easier to reproduce
than before.
However, the issue appears to have subsided, and the only change I can
potentially attribute it to was after turning on
> SlurmctldParameters=rl_enable
in
I seem to have run into an edge case where I’m able to oversubscribe a specific
subset of GPUs on one host in particular.
Slurm 22.05.8
Ubuntu 20.04
cgroups v1 (ProctrackType=proctrack/cgroup)
It seems to be partly a corner case with a couple of caveats.
This host has 2 different GPU types in
I have a bash script that grabs current statistics from sinfo to ship into a
time series database to use for Grafana dashboards.
We recently began using shards with our gpus, and I’m seeing some unexpected
behavior with the data reported from sinfo.
> $ sinfo -h -O "NodeHost:5 ,GresUsed:100