[slurm-users] nodes going to down* and getting stuck in that state

2021-05-19 Thread Herc Silverstein
Hi, We have a cluster (in Google gcp) which has a few partitions set up to auto-scale, but one partition is set up to not autoscale. The desired state is for all of the nodes in this non-autoscaled partition (SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted.  However, we

Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread bbenedetto
We had a situation recently where a desktop was turned off for a week. When we brought it back online (in a different part of the network with a different IP), everything came up fine (slurmd and munge). But it kept going into DOWN* for no apparent reason (neither daemon-wise nor log-wise). As p

Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread Brian Andrus
Does it tell you the reason for it being down? sinfo -R I have seen where a node comes up, but the amount of memory slurmd sees is a little less than what was configured in slurm.conf. You should always set aside some of the memory when defining it in slurm.conf so you have memory for the oper

Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread Tim Carlson
The SLURM controller AND all the compute nodes need to know who all is in the cluster. If you want to add a node or it changes IP addresses, you need to let all the nodes know about this which, for me, usually means restarting slurmd on the compute nodes. I just say this because I get caught by th

Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread Christopher Samuel
On 5/19/21 9:15 pm, Herc Silverstein wrote: Does anyone have an idea of what might be going on? To add to the other suggestions, I would say that checking the slurmctld and slurmd logs to see what it is saying is wrong is a good place to start. Best of luck, Chris -- Chris Samuel : http

Re: [slurm-users] nodes going to down* and getting stuck in, that state

2021-06-04 Thread Herc Silverstein
Hi, The slurmctld.log shows (for this node): ... [2021-05-25T00:12:27.481] sched: Allocate JobId=3402729 NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand [2021-05-25T00:12:27.482] sched: Allocate JobId=3402730 NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondem

Re: [slurm-users] nodes going to down* and getting stuck in, that state

2021-06-04 Thread Brian Andrus
Sounds like a firewall issue. When you log on to the 'down' node, can you run 'sinfo' or 'squeue' there? Also, verify munge is configured/running properly on the node. Brian Andrus On 6/4/2021 9:31 AM, Herc Silverstein wrote: Hi, The slurmctld.log shows (for this node): ... [2021-05-25T00:

Re: [slurm-users] nodes going to down* and getting stuck in, that state

2021-06-04 Thread Brian Andrus
Oh, also ensure the dns is working properly on the node. It could be that it isn't able to map the name to ip of the master. Brian Andrus On 6/4/2021 9:31 AM, Herc Silverstein wrote: Hi, The slurmctld.log shows (for this node): ... [2021-05-25T00:12:27.481] sched: Allocate JobId=3402729 N