Re: [slurm-users] Can't get node out of drain state
On 23/1/20 7:09 pm, Dean Schulze wrote: Pretty strange that having a Gres= property on a node that doesn't have a gpu would get it stuck in the drain state. Slurm verifies that nodes have the capabilities you say they have so that should a node boot with less RAM than it should have, or a socket hidden or should a GPU fail and a node reboot you'll know about it and not blindly send jobs to it only for them to find they fail because they no longer meet their requirements. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Can't get node out of drain state
The problem turned out to be that I had Gres=gpu:gp100:1 on the NodeName line for that node and it didn't have a gpu or a gres.conf. Once I moved that to the correct NodeName line in slurm.conf that node came out of the drain state and became usable again. Pretty strange that having a Gres= property on a node that doesn't have a gpu would get it stuck in the drain state. On Thu, Jan 23, 2020 at 2:34 PM Alex Chekholko wrote: > Hey Dean, > > Does 'scontrol show node at 'sinfo -R'. > > Make sure the relevant network ports are open: > > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons > > Also check that slurmd daemons on the compute nodes can talk to each other > (not just to the master). e.g. bottom of > https://slurm.schedmd.com/big_sys.html > > Regards, > Alex > > On Thu, Jan 23, 2020 at 1:05 PM Dean Schulze > wrote: > >> I've tried the normal things with scontrol ( >> https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/), >> but I have a node that will not come out of the drain state. >> >> I've also done a hard reboot and tried again. Are there any other >> remedies? >> >> Thanks. >> >
Re: [slurm-users] Can't get node out of drain state
Hey Dean, Does 'scontrol show node https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons Also check that slurmd daemons on the compute nodes can talk to each other (not just to the master). e.g. bottom of https://slurm.schedmd.com/big_sys.html Regards, Alex On Thu, Jan 23, 2020 at 1:05 PM Dean Schulze wrote: > I've tried the normal things with scontrol ( > https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/), but > I have a node that will not come out of the drain state. > > I've also done a hard reboot and tried again. Are there any other > remedies? > > Thanks. >
[slurm-users] Can't get node out of drain state
I've tried the normal things with scontrol ( https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/), but I have a node that will not come out of the drain state. I've also done a hard reboot and tried again. Are there any other remedies? Thanks.