[slurm-users] only 1 job running

2021-01-27 Thread Chandler
Hi list, we have a new cluster setup with Bright cluster manager. Looking into a support contract there, but trying to get community support in the mean time. I'm sure things were working when the cluster was delivered, but I provisioned an additional node and now the scheduler isn't quite wo

Re: [slurm-users] only 1 job running

2021-01-27 Thread Chandler
Made a little bit of progress by running sinfo: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq*up infinite 3 drain n[011-013] defq*up infinite 1 alloc n010 not sure why n[011-013] are in drain state, that needs to be fixed. After some searching, I ran: s

Re: [slurm-users] only 1 job running

2021-01-28 Thread Andy Riebs
Hi Chandler, If the only changes to your system have been the slurm.conf configuration and the addition of a new node, the easiest way to track this down is probably to show us the diffs between the previous and current versions of slurm.conf, and a note about what's different about the new n

Re: [slurm-users] only 1 job running

2021-01-28 Thread Brian Andrus
Heh. Your nodes are drained. do: scontrol update state=resume nodename=n[011-013] If they go back into a drained state, you need to look into why. That will be in the slurmctld log. You can also see it with 'sinfo -R' Brian Andrus On 1/27/2021 10:18 PM, Chandler wrote: Made a little bit of

Re: [slurm-users] only 1 job running

2021-01-28 Thread Christopher Samuel
On 1/27/21 9:28 pm, Chandler wrote: Hi list, we have a new cluster setup with Bright cluster manager. Looking into a support contract there, but trying to get community support in the mean time.  I'm sure things were working when the cluster was delivered, but I provisioned an additional node

Re: [slurm-users] only 1 job running

2021-01-28 Thread Chandler
Christopher Samuel wrote on 1/28/21 12:50: Did you restart the slurm daemons when you added the new node?  Some internal data structures (bitmaps) are build based on the number of nodes and they need to be rebuild with a restart in this situation. https://slurm.schedmd.com/faq.html#add_nodes