[slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello,

slurm 20.02.7 on FreeBSD.

I have a couple of nodes stuck in the drain state.  I have tried

scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume

without success.

I then tried

/usr/local/sbin/slurmctld -c
scontrol update nodename=node012 state=idle

also without success.

Is there some other method I can use to get these nodes back up?

Thanks,
Roger



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen

On 5/25/23 13:59, Roger Mason wrote:

slurm 20.02.7 on FreeBSD.


Uh, that's old!


I have a couple of nodes stuck in the drain state.  I have tried

scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume

without success.

I then tried

/usr/local/sbin/slurmctld -c
scontrol update nodename=node012 state=idle

also without success.

Is there some other method I can use to get these nodes back up?


What's the output of "scontrol show node node012"?

/Ole



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Doug Meyer
Could also review the node log in /varlog/slurm/ .  Often sinfo -lR will
tell you the cause, fro example mem not matching the config.

Doug

On Thu, May 25, 2023 at 5:32 AM Ole Holm Nielsen 
wrote:

> On 5/25/23 13:59, Roger Mason wrote:
> > slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!
>
> > I have a couple of nodes stuck in the drain state.  I have tried
> >
> > scontrol update nodename=node012 state=down reason="stuck in drain state"
> > scontrol update nodename=node012 state=resume
> >
> > without success.
> >
> > I then tried
> >
> > /usr/local/sbin/slurmctld -c
> > scontrol update nodename=node012 state=idle
> >
> > also without success.
> >
> > Is there some other method I can use to get these nodes back up?
>
> What's the output of "scontrol show node node012"?
>
> /Ole
>
>


Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason


Ole Holm Nielsen  writes:

> On 5/25/23 13:59, Roger Mason wrote:
>> slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!

Yes.  It is what is available in ports.

> What's the output of "scontrol show node node012"?

NodeName=node012 CoresPerSocket=2 
   CPUAlloc=0 CPUTot=4 CPULoad=N/A
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node012 NodeHostName=node012 
   RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   State=UNKNOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A 
MCS_label=N/A
   Partitions=macpro 
   BootTime=None SlurmdStartTime=None
   CfgTRES=cpu=4,mem=10193M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2023-05-25T09:26:59]

But the 'Low RealMemory' is incorrect.  The entry in slurm.conf for
node012 is:

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN

Thanks for the help.
Roger



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello,

Doug Meyer  writes:

> Could also review the node log in /varlog/slurm/ .  Often sinfo -lR will tell 
> you the cause, fro example mem not matching the config.
>
REASON   USER TIMESTAMP   STATE  NODELIST 
Low RealMemory   slurm(468)   2023-05-25T09:26:59 drain* node012 
Not responding   slurm(468)   2023-05-25T09:30:31 down*
node[001-003,008]

But, as I sail in my response to Ole, the memory in slurm.conf and in
the 'show node' output match.

Many thanks for the help.

Roger



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Davide DelVento
Can you ssh into the node and check the actual availability of memory?
Maybe there is a zombie process (or a healthy one with a memory leak bug)
that's hogging all the memory?

On Thu, May 25, 2023 at 7:31 AM Roger Mason  wrote:

> Hello,
>
> Doug Meyer  writes:
>
> > Could also review the node log in /varlog/slurm/ .  Often sinfo -lR will
> tell you the cause, fro example mem not matching the config.
> >
> REASON   USER TIMESTAMP   STATE  NODELIST
> Low RealMemory   slurm(468)   2023-05-25T09:26:59 drain* node012
> Not responding   slurm(468)   2023-05-25T09:30:31 down*
> node[001-003,008]
>
> But, as I sail in my response to Ole, the memory in slurm.conf and in
> the 'show node' output match.
>
> Many thanks for the help.
>
> Roger
>
>


Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen

On 5/25/23 15:23, Roger Mason wrote:

NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node012 NodeHostName=node012
RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
State=UNKNOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A 
MCS_label=N/A
Partitions=macpro
BootTime=None SlurmdStartTime=None
CfgTRES=cpu=4,mem=10193M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2023-05-25T09:26:59]

But the 'Low RealMemory' is incorrect.  The entry in slurm.conf for
node012 is:

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN


Thanks for the info.  Some questions arise:

1. Is slurmd running on the node?

2. What's the output of "slurmd -C" on the node?

3. Define State=UP in slurm.conf in stead of UNKNOWN

4. Why have you configured TmpDisk=0?  It should be the size of the /tmp 
filesystem.


Since you run Slurm 20.02, there are some suggestions in my Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration 
where this might be useful:



Note for Slurm 20.02: The Boards=1 SocketsPerBoard=2 configuration gives error 
messages, see bug_9241 and bug_9233. Use Sockets= in stead:


I hope changing these slurm.conf parameters will help.

Best regards,
Ole






Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello,

Davide DelVento  writes:

> Can you ssh into the node and check the actual availability of memory?
> Maybe there is a zombie process (or a healthy one with a memory leak
> bug) that's hogging all the memory?

This is what top shows:

last pid: 45688;  load averages:  0.00,  0.00,  0.00
   up 0+03:56:52  11:58:13
26 processes:  1 running, 25 sleeping
CPU:  0.0% user,  0.0% nice,  0.1% system,  0.0% interrupt, 99.9% idle
Mem: 9452K Active, 69M Inact, 290M Wired, 287K Buf, 5524M Free
ARC: 125M Total, 37M MFU, 84M MRU, 168K Anon, 825K Header, 3476K Other
 36M Compressed, 89M Uncompressed, 2.46:1 Ratio
Swap: 10G Total, 10G Free

Thanks for the suggestion.

Roger



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason


Ole Holm Nielsen  writes:

> 1. Is slurmd running on the node?
Yes.

> 2. What's the output of "slurmd -C" on the node?
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=6097

> 3. Define State=UP in slurm.conf in stead of UNKNOWN
Will do.

> 4. Why have you configured TmpDisk=0?  It should be the size of the
> /tmp filesystem.
I have not configured TmpDisk.  This the entry in slurm.conf for that
node:
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN

But I do notice that slurmd -C now says there is less memory than
configured.

Thanks again.

Roger



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Brian Andrus

That output of slurmd -C is your answer.

Slurmd only sees 6GB of memory and you are claiming it has 10GB.

I would run some memtests, look at meminfo on the node, etc.

Maybe even check that the type/size of memory in there is what you think 
it is.


Brian Andrus

On 5/25/2023 7:30 AM, Roger Mason wrote:

Ole Holm Nielsen  writes:


1. Is slurmd running on the node?

Yes.


2. What's the output of "slurmd -C" on the node?

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=6097


3. Define State=UP in slurm.conf in stead of UNKNOWN

Will do.


4. Why have you configured TmpDisk=0?  It should be the size of the
/tmp filesystem.

I have not configured TmpDisk.  This the entry in slurm.conf for that
node:
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN

But I do notice that slurmd -C now says there is less memory than
configured.

Thanks again.

Roger





Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Groner, Rob
A quick test to see if it's a configuration error is to set config_overrides in 
your slurm.conf and see if the node then responds to scontrol update.


From: slurm-users  on behalf of Brian 
Andrus 
Sent: Thursday, May 25, 2023 10:54 AM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Nodes stuck in drain state

That output of slurmd -C is your answer.

Slurmd only sees 6GB of memory and you are claiming it has 10GB.

I would run some memtests, look at meminfo on the node, etc.

Maybe even check that the type/size of memory in there is what you think
it is.

Brian Andrus

On 5/25/2023 7:30 AM, Roger Mason wrote:
> Ole Holm Nielsen  writes:
>
>> 1. Is slurmd running on the node?
> Yes.
>
>> 2. What's the output of "slurmd -C" on the node?
> NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
> ThreadsPerCore=1 RealMemory=6097
>
>> 3. Define State=UP in slurm.conf in stead of UNKNOWN
> Will do.
>
>> 4. Why have you configured TmpDisk=0?  It should be the size of the
>> /tmp filesystem.
> I have not configured TmpDisk.  This the entry in slurm.conf for that
> node:
> NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
> ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN
>
> But I do notice that slurmd -C now says there is less memory than
> configured.
>
> Thanks again.
>
> Roger
>



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello,

"Groner, Rob"  writes:

> A quick test to see if it's a configuration error is to set
> config_overrides in your slurm.conf and see if the node then responds
> to scontrol update.

Thanks to all who helped.  It turned out that memory was the issue.  I
have now reseated the RAM in the offending node and all seems well.

I have another node also stuck in drain that I will investigate.  I
picked up some useful tips from the replies, but if I can't get it back
on-line I hope the friendly people on this list will rescue me.

Thanks again,
Roger



[slurm-users] Nodes stuck in drain state and sending Invalid Argument every second

2020-02-06 Thread Dean Schulze
I moved two nodes to another controller and the two nodes will not come out
of the drain state now.  I've rebooted the hosts but they are still stuck
in the drain state.  There is nothing in the location given for saving
state so I can't understand why a reboot doesn't clear this.

Here's the node state:

$ scontrol show node slurmnode1
NodeName=slurmnode1 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUTot=16 CPULoad=0.58
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:gp100:4
   NodeAddr=slurmnode1 NodeHostName=slurmnode1 Version=19.05.4
   OS=Linux 5.3.0-28-generic #30~18.04.1-Ubuntu SMP Fri Jan 17 06:14:09 UTC
2020
   RealMemory=47671 AllocMem=0 FreeMem=46385 Sockets=1 Boards=1
   State=DOWN*+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=debug
   BootTime=2020-02-06T13:48:25 SlurmdStartTime=2020-02-06T13:48:31
   CfgTRES=cpu=16,mem=47671M,billing=16
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=none [dean@2020-02-06T13:38:13]


The nodes are also sending the controller an error nearly every second
while the slurmds are running:

error: _slurm_rpc_node_registration node=slurmnode2: Invalid argument

I did have to open up the slurm ports on the network after moving these two
nodes to the new controller since the nodes are wired while the controller
is wireless, but there seems to be two way communication.

Any ideas what the problem is?