Re: [slurm-users] Can't get node out of drain state

2020-01-23 Thread Chris Samuel

On 23/1/20 7:09 pm, Dean Schulze wrote:

Pretty strange that having a Gres= property on a node that doesn't have 
a gpu would get it stuck in the drain state.


Slurm verifies that nodes have the capabilities you say they have so 
that should a node boot with less RAM than it should have, or a socket 
hidden or should a GPU fail and a node reboot you'll know about it and 
not blindly send jobs to it only for them to find they fail because they 
no longer meet their requirements.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Can't get node out of drain state

2020-01-23 Thread Dean Schulze
The problem turned out to be that I had Gres=gpu:gp100:1 on the NodeName
line for that node and it didn't have a gpu or a gres.conf.  Once I moved
that to the correct NodeName line in slurm.conf that node came out of the
drain state and became usable again.

Pretty strange that having a Gres= property on a node that doesn't have a
gpu would get it stuck in the drain state.



On Thu, Jan 23, 2020 at 2:34 PM Alex Chekholko  wrote:

> Hey Dean,
>
> Does 'scontrol show node  at 'sinfo -R'.
>
> Make sure the relevant network ports are open:
>
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>
> Also check that slurmd daemons on the compute nodes can talk to each other
> (not just to the master). e.g. bottom of
> https://slurm.schedmd.com/big_sys.html
>
> Regards,
> Alex
>
> On Thu, Jan 23, 2020 at 1:05 PM Dean Schulze 
> wrote:
>
>> I've tried the normal things with scontrol (
>> https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/),
>> but I have a node that will not come out of the drain state.
>>
>> I've also done a hard reboot and tried again.  Are there any other
>> remedies?
>>
>> Thanks.
>>
>


Re: [slurm-users] Can't get node out of drain state

2020-01-23 Thread Alex Chekholko
Hey Dean,

Does 'scontrol show node https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons

Also check that slurmd daemons on the compute nodes can talk to each other
(not just to the master). e.g. bottom of
https://slurm.schedmd.com/big_sys.html

Regards,
Alex

On Thu, Jan 23, 2020 at 1:05 PM Dean Schulze 
wrote:

> I've tried the normal things with scontrol (
> https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/), but
> I have a node that will not come out of the drain state.
>
> I've also done a hard reboot and tried again.  Are there any other
> remedies?
>
> Thanks.
>


[slurm-users] Can't get node out of drain state

2020-01-23 Thread Dean Schulze
I've tried the normal things with scontrol (
https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/), but I
have a node that will not come out of the drain state.

I've also done a hard reboot and tried again.  Are there any other remedies?

Thanks.