Hi,

We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid 
having to specify GPUs explicitly in slurm.conf and gres.conf. We're running 
Debian 11, and the version of Slurm available for Debian 11 is 20.11. However, 
the version of Slurm in the standard debian repositories was apparently not 
compiled on a system with the necessary Nvidia library installed, so we 
recompiled Slurm 20.11 from the Debian source package with no modifications.

With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is what 
we see on a 4-GPU host after restarting slurmd:

[2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries 
SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
[2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system 
device(s) detected
[2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The 
following autodetected GPUs are being ignored:
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-11-29T15:50:02.614] slurmd version 20.11.4 started
[2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
[2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 
Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)

Doing an "scontrol show node" for this host displays "Gres=(null)", and any 
attempts to submit a job with --gpus=1 results in "srun: error: Unable to 
allocate resources: Requested node configuration is not available".

Any idea what might be wrong?

Thanks,
~~ bnacar

-- 
Benjamin Nacar
Systems Programmer
Computer Science Department
Brown University
401.863.7621

Reply via email to