Re: [slurm-users] MIG-Slice: Unavailable GRES

2023-07-19 Thread Groner, Rob
At some point when we were experimenting with MIG, I was being entirely frustrated in getting it to work until I finally removed the autodetect from gres.conf and explicitly listed the stuff instead. THEN it worked. I think you can find the list of files that are the device files using

[slurm-users] MIG-Slice: Unavailable GRES

2023-07-19 Thread Vogt, Timon
Dear Slurm Mailing List, I am experiencing a problem which affects our cluster and for which I am completely out of ideas by now, so I would like to ask the community for hints or ideas. We run a partition on our cluster containing multiple nodes with Nvidia A100 GPUs (40GB), which we have

Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-19 Thread Wilson, Steven M
I found that this is actually a known bug in Slurm so I'll note it here in case anyone comes across this thread in the future:   https://bugs.schedmd.com/show_bug.cgi?id=10598 Steve From: slurm-users on behalf of Wilson, Steven M Sent: Tuesday, July 18,

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier
Hi Hermann, count doesn't make a difference, but I noticed that when I reconfigure slurm and do reloads afterwards, the error "gpu count lower than configured" no longer appears - so maybe it is just because a reconfigure is needed after reloading slurmctld - or maybe it doesn't show the error

[slurm-users] MCNP6.2 test

2023-07-19 Thread Ozeryan, Vladimir
Hello everyone, Has anyone here ever ran MCNP6.2 parallel job via Slurm scheduler? I am looking for a simple test job to test my software compilation. Thank you, Vlad Ozeryan

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Timo Rothenpieler
On 19/07/2023 15:04, Jan Andersen wrote: Hmm, OK - but that is the only nvml.h I can find, as shown by the find command. I downloaded the official NVIDIA-Linux-x86_64-535.54.03.run and ran it successfully; do I need to install something else beside? A google search for 'CUDA SDK' leads

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Jeffrey T Frey
In case you're developing the plugin in C and not LUA, behind the scenes the LUA mechanism is concatenating all log_user() strings into a single variable (user_msg). When the LUA code completes, the C code sets the *err_msg argument to the job_submit()/job_modify() function to that string,

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Groner, Rob
Worth a try, but the documentation says that by default the count is the same as the number of files specified...so, should automatically be 1. If you want to stop the node from going to INVAL, you can always set config_overrides in slurm.conf. That will tell the node what it has, instead of

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Hermann Schwärzler
Hi Xaver, I think you are missing the "Count=..." part in gres.conf It should read NodeName=NName Name=gpu File=/dev/tty0 Count=1 in your case. Regards, Hermann On 7/19/23 14:19, Xaver Stiensmeier wrote: Okay, thanks to S. Zhang I was able to figure out why nothing changed. While I did

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Jan Andersen
Hmm, OK - but that is the only nvml.h I can find, as shown by the find command. I downloaded the official NVIDIA-Linux-x86_64-535.54.03.run and ran it successfully; do I need to install something else beside? A google search for 'CUDA SDK' leads directly to NVIDIA's page:

Re: [slurm-users] slurmctld and slurmdbd on the server, mysql on remote

2023-07-19 Thread AMU
oups, i found my error, i forgot to remove JobCompHost, found it after reading this: https://bugs.schedmd.com/show_bug.cgi?id=2322#c5 sorry for the noise On 19/07/2023 14:51, Gérard Henry (AMU) wrote: Hello all, is it possible to have this configuration? i installed slurm on ubuntu 20 LTS,

[slurm-users] slurmctld and slurmdbd on the server, mysql on remote

2023-07-19 Thread AMU
Hello all, is it possible to have this configuration? i installed slurm on ubuntu 20 LTS, but slurmctld refuses to start with messages: [2023-07-19T14:37:59.563] Job completion MYSQL plugin loaded [2023-07-19T14:37:59.563] debug: /var/log/slurm/jobcomp doesn't look like a database name using

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Angel de Vicente
Hello Lorenzo, Lorenzo Bosio writes: > I'm developing a job submit plugin to check if some conditions are met before > a job runs. > I'd need a way to notify the user about the plugin actions (i.e. why its jobs > was killed and what to do), but after a lot of research I could only write to >

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Ole Holm Nielsen
Hi Lorenzo, On 7/19/23 14:22, Lorenzo Bosio wrote: > I'm developing a job submit plugin to check if some conditions are met > before a job runs. > I'd need a way to notify the user about the plugin actions (i.e. why its > jobs was killed and what to do), but after a lot of research I could only

[slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Lorenzo Bosio
Hello everyone, I'm developing a job submit plugin to check if some conditions are met before a job runs. I'd need a way to notify the user about the plugin actions (i.e. why its jobs was killed and what to do), but after a lot of research I could only write to logs and not the user shell.

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier
Okay, thanks to S. Zhang I was able to figure out why nothing changed. While I did restart systemctld at the beginning of my tests, I didn't do so later, because I felt like it was unnecessary, but it is right there in the fourth line of the log that this is needed. Somehow I misread it and

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Timo Rothenpieler
On 19/07/2023 11:47, Jan Andersen wrote: I'm trying to build slurm with nvml support, but configure doesn't find it: root@zorn:~/slurm-23.02.3# ./configure --with-nvml ... checking for hwloc installation... /usr checking for nvml.h... no checking for nvmlInit in -lnvidia-ml... yes configure:

[slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Jan Andersen
I'm trying to build slurm with nvml support, but configure doesn't find it: root@zorn:~/slurm-23.02.3# ./configure --with-nvml ... checking for hwloc installation... /usr checking for nvml.h... no checking for nvmlInit in -lnvidia-ml... yes configure: error: unable to locate libnvidia-ml.so

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier
Alright, I tried a few more things, but I still wasn't able to get past: srun: error: Unable to allocate resources: Invalid generic resource (gres) specification. I should mention that the node I am trying to test GPU with, doesn't really have a gpu, but Rob was so kind to find out that you do