[slurm-users] How to fix a node in state=inval?
I am building a cluster exclusively with dynamic nodes, which all boot up over the network from the same system image (Debian 12); so far there is just one physical node, as well as a vm that I have used for the initial tests: # sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST all* up infinite 1 inval gpu18c04d858b05 all* up infinite 1 down* node080027aea419 When I compare what the master node thinks of gpu18c04d858b05 with what the node itself reports, they seem to agree: On gpu18c04d858b05: root@gpu18c04d858b05:~# slurmd -C NodeName=gpu18c04d858b05 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64240 UpTime=0-18:04:06 And on the master: # scontrol show node gpu18c04d858b05 NodeName=gpu18c04d858b05 Arch=x86_64 CoresPerSocket=8 CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.16 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:geforce:1 NodeAddr=192.168.50.68 NodeHostName=gpu18c04d858b05 Version=23.02.3 OS=Linux 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) RealMemory=64240 AllocMem=0 FreeMem=63739 Sockets=1 Boards=1 State=DOWN+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=all BootTime=2023-08-31T15:25:55 SlurmdStartTime=2023-08-31T15:26:20 LastBusyTime=2023-08-31T10:24:01 ResumeAfterTime=None CfgTRES=cpu=16,mem=64240M,billing=16 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=hang [root@2023-08-31T16:38:27] I tried to fix it with: # scontrol update nodename=gpu18c04d858b05 state=down reason=hang # scontrol update nodename=gpu18c04d858b05 state=resume However, that made no difference; what is the next step in troubleshooting this issue?
Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so
7faee7817000) Also, slurmd.log doesn't list the GPU even when I set SlurmdDebug=debug2 in the config file (but I see other entries for debug2). I've set 'AutoDetect=nvml' in gres.conf and 'GresTypes=gpu' in slurm.conf; shouldn't it work by now? Or is my build of slurm still not good? On 19/07/2023 12:26, Timo Rothenpieler wrote: On 19/07/2023 11:47, Jan Andersen wrote: I'm trying to build slurm with nvml support, but configure doesn't find it: root@zorn:~/slurm-23.02.3# ./configure --with-nvml ... checking for hwloc installation... /usr checking for nvml.h... no checking for nvmlInit in -lnvidia-ml... yes configure: error: unable to locate libnvidia-ml.so and/or nvml.h But: root@zorn:~/slurm-23.02.3# find / -xdev -name nvml.h /usr/include/hwloc/nvml.h It's not looking for the hwloc header, but for the nvidia one. If you have your CUDA SDK installed in for example /opt/cuda, you got to point it there: --with-nvml=/opt/cuda root@zorn:~/slurm-23.02.3# find / -xdev -name libnvidia-ml.so /usr/lib32/libnvidia-ml.so /usr/lib/x86_64-linux-gnu/libnvidia-ml.so I tried to figure out how to tell configure where to find them, but the script is a bit eye-watering; how should I do?
Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so
Hmm, OK - but that is the only nvml.h I can find, as shown by the find command. I downloaded the official NVIDIA-Linux-x86_64-535.54.03.run and ran it successfully; do I need to install something else beside? A google search for 'CUDA SDK' leads directly to NVIDIA's page: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html On 19/07/2023 12:26, Timo Rothenpieler wrote: On 19/07/2023 11:47, Jan Andersen wrote: I'm trying to build slurm with nvml support, but configure doesn't find it: root@zorn:~/slurm-23.02.3# ./configure --with-nvml ... checking for hwloc installation... /usr checking for nvml.h... no checking for nvmlInit in -lnvidia-ml... yes configure: error: unable to locate libnvidia-ml.so and/or nvml.h But: root@zorn:~/slurm-23.02.3# find / -xdev -name nvml.h /usr/include/hwloc/nvml.h It's not looking for the hwloc header, but for the nvidia one. If you have your CUDA SDK installed in for example /opt/cuda, you got to point it there: --with-nvml=/opt/cuda root@zorn:~/slurm-23.02.3# find / -xdev -name libnvidia-ml.so /usr/lib32/libnvidia-ml.so /usr/lib/x86_64-linux-gnu/libnvidia-ml.so I tried to figure out how to tell configure where to find them, but the script is a bit eye-watering; how should I do?
[slurm-users] configure script can't find nvml.h or libnvidia-ml.so
I'm trying to build slurm with nvml support, but configure doesn't find it: root@zorn:~/slurm-23.02.3# ./configure --with-nvml ... checking for hwloc installation... /usr checking for nvml.h... no checking for nvmlInit in -lnvidia-ml... yes configure: error: unable to locate libnvidia-ml.so and/or nvml.h But: root@zorn:~/slurm-23.02.3# find / -xdev -name nvml.h /usr/include/hwloc/nvml.h root@zorn:~/slurm-23.02.3# find / -xdev -name libnvidia-ml.so /usr/lib32/libnvidia-ml.so /usr/lib/x86_64-linux-gnu/libnvidia-ml.so I tried to figure out how to tell configure where to find them, but the script is a bit eye-watering; how should I do?