Re: [slurm-users] nvml autodetect is ignoring gpus

2021-11-30 Thread Diego Zuccato
Il 30/11/2021 16:12, Benjamin Nacar ha scritto: However, the version of Slurm in the standard debian repositories was apparently not compiled on a system with the necessary Nvidia library installed, That's not a good news :( I have a GPU node arriving by the end of the year. Does it only

[slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-11-30 Thread Adrian Sevcenco
Hi! Does anyone know what could the the cause of such error? I have a shared home, slurm 20.11.8 and i try a simple script in the submit directory which is in the home that is nfs shared... also i have job_container.conf defined, but i have no idea if this is a problem.. Thank you! Adrian

[slurm-users] nvml autodetect is ignoring gpus

2021-11-30 Thread Benjamin Nacar
Hi, We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid having to specify GPUs explicitly in slurm.conf and gres.conf. We're running Debian 11, and the version of Slurm available for Debian 11 is 20.11. However, the version of Slurm in the standard debian repositories

Re: [slurm-users] WTERMSIG 15

2021-11-30 Thread LEROY Christine 208562
Hi, Thanks for your feedback. It seems we are in the 1st case, but then looking deeper: for SL7 node we didn’t encounter the problem thanks to this service configuration (*). So the solution seems to configure KillMode=process as mention there (**): we will still have jobs listed when doing a

Re: [slurm-users] Error " slurm_receive_msg_and_forward: Zero Bytes were transmitted or received"

2021-11-30 Thread Nicolas Greneche
Hi,I had the same issue with ntpd. My ntp service on clients did not synchronize because the drift with the ntp server was too large.Maybe you can synchronize with ntpdate before using ntp service on your clients.Regards,Le 30 nov. 2021 12:23, Gestió Servidors a écrit : Hello,   In last

[slurm-users] Error " slurm_receive_msg_and_forward: Zero Bytes were transmitted or received"

2021-11-30 Thread Gestió Servidors
Hello, In last days, my nodes are showing error "slurm_receive_msg_and_forward: Zero Bytes were transmitted or received". After reviewing all configuration, I have notice that problem is the time difference between nodes and server. If nodes are "bad" configured (time in the future or in the