[slurm-users] How to fix a node in state=inval?

2023-09-01 Thread Jan Andersen
I am building a cluster exclusively with dynamic nodes, which all boot 
up over the network from the same system image (Debian 12); so far there 
is just one physical node, as well as a vm that I have used for the 
initial tests:


# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all* up   infinite  1  inval gpu18c04d858b05
all* up   infinite  1  down* node080027aea419

When I compare what the master node thinks of gpu18c04d858b05 with what 
the node itself reports, they seem to agree:


On gpu18c04d858b05:

root@gpu18c04d858b05:~# slurmd -C
NodeName=gpu18c04d858b05 CPUs=16 Boards=1 SocketsPerBoard=1 
CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64240

UpTime=0-18:04:06

And on the master:

# scontrol show node gpu18c04d858b05
NodeName=gpu18c04d858b05 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.16
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:geforce:1
   NodeAddr=192.168.50.68 NodeHostName=gpu18c04d858b05 Version=23.02.3
   OS=Linux 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 
(2023-05-08)

   RealMemory=64240 AllocMem=0 FreeMem=63739 Sockets=1 Boards=1
   State=DOWN+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0 
Weight=1 Owner=N/A MCS_label=N/A

   Partitions=all
   BootTime=2023-08-31T15:25:55 SlurmdStartTime=2023-08-31T15:26:20
   LastBusyTime=2023-08-31T10:24:01 ResumeAfterTime=None
   CfgTRES=cpu=16,mem=64240M,billing=16
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=hang [root@2023-08-31T16:38:27]

I tried to fix it with:

# scontrol update nodename=gpu18c04d858b05 state=down reason=hang
# scontrol update nodename=gpu18c04d858b05 state=resume

However, that made no difference; what is the next step in 
troubleshooting this issue?




Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-21 Thread Jan Andersen
7faee7817000)


Also, slurmd.log doesn't list the GPU even when I set SlurmdDebug=debug2 
in the config file (but I see other entries for debug2).


I've set 'AutoDetect=nvml' in gres.conf and 'GresTypes=gpu' in 
slurm.conf; shouldn't it work by now? Or is my build of slurm still not 
good?



On 19/07/2023 12:26, Timo Rothenpieler wrote:

On 19/07/2023 11:47, Jan Andersen wrote:
I'm trying to build slurm with nvml support, but configure doesn't 
find it:


root@zorn:~/slurm-23.02.3# ./configure --with-nvml
...
checking for hwloc installation... /usr
checking for nvml.h... no
checking for nvmlInit in -lnvidia-ml... yes
configure: error: unable to locate libnvidia-ml.so and/or nvml.h

But:

root@zorn:~/slurm-23.02.3# find / -xdev -name nvml.h
/usr/include/hwloc/nvml.h


It's not looking for the hwloc header, but for the nvidia one.
If you have your CUDA SDK installed in for example /opt/cuda, you got to 
point it there: --with-nvml=/opt/cuda



root@zorn:~/slurm-23.02.3# find / -xdev -name libnvidia-ml.so
/usr/lib32/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so

I tried to figure out how to tell configure where to find them, but 
the script is a bit eye-watering; how should I do?









Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Jan Andersen
Hmm, OK - but that is the only nvml.h I can find, as shown by the find 
command. I downloaded the official NVIDIA-Linux-x86_64-535.54.03.run and 
ran it successfully; do I need to install something else beside? A 
google search for 'CUDA SDK' leads directly to NVIDIA's page: 
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html




On 19/07/2023 12:26, Timo Rothenpieler wrote:

On 19/07/2023 11:47, Jan Andersen wrote:
I'm trying to build slurm with nvml support, but configure doesn't 
find it:


root@zorn:~/slurm-23.02.3# ./configure --with-nvml
...
checking for hwloc installation... /usr
checking for nvml.h... no
checking for nvmlInit in -lnvidia-ml... yes
configure: error: unable to locate libnvidia-ml.so and/or nvml.h

But:

root@zorn:~/slurm-23.02.3# find / -xdev -name nvml.h
/usr/include/hwloc/nvml.h


It's not looking for the hwloc header, but for the nvidia one.
If you have your CUDA SDK installed in for example /opt/cuda, you got to 
point it there: --with-nvml=/opt/cuda



root@zorn:~/slurm-23.02.3# find / -xdev -name libnvidia-ml.so
/usr/lib32/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so

I tried to figure out how to tell configure where to find them, but 
the script is a bit eye-watering; how should I do?









[slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Jan Andersen

I'm trying to build slurm with nvml support, but configure doesn't find it:

root@zorn:~/slurm-23.02.3# ./configure --with-nvml
...
checking for hwloc installation... /usr
checking for nvml.h... no
checking for nvmlInit in -lnvidia-ml... yes
configure: error: unable to locate libnvidia-ml.so and/or nvml.h

But:

root@zorn:~/slurm-23.02.3# find / -xdev -name nvml.h
/usr/include/hwloc/nvml.h
root@zorn:~/slurm-23.02.3# find / -xdev -name libnvidia-ml.so
/usr/lib32/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so

I tried to figure out how to tell configure where to find them, but the 
script is a bit eye-watering; how should I do?