[slurm-users] node health check

2023-01-30 Thread Ratnasamy, Fritz
Hi,

 Currently, some of our nodes are overloaded. The nhc installed used to
check the load and drain the node when it is overloaded. However, for the
past few  days, it is not showing the state of the node. When I run
/usr/sbin/nhc manually, it says
20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online
mcn26.chicagobooth.edu
/usr/libexec/nhc/node-mark-online:  Not sure how to handle node state "" on
mcn26.chicagobooth.edu
/usr/libexec/nhc/node-mark-online:  Skipping  node mcn26.chicagobooth.edu (
)

It seems that it is not able to read the state of the node. I ran scontrol
show node mcn26
NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
   NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8

Any idea what happened and why nhc is not reading the state of the node
anymore?
Best,


*Fritz Ratnasamy*

Data Scientist

Information Technology


Re: [slurm-users] node health check

2023-01-30 Thread Ole Holm Nielsen

On 1/31/23 04:35, Ratnasamy, Fritz wrote:
  Currently, some of our nodes are overloaded. The nhc installed used to 
check the load and drain the node when it is overloaded. However, for the 
past few  days, it is not showing the state of the node. When I run 
/usr/sbin/nhc manually, it says
20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online 
mcn26.chicagobooth.edu 
/usr/libexec/nhc/node-mark-online:  Not sure how to handle node state "" 
on mcn26.chicagobooth.edu 
/usr/libexec/nhc/node-mark-online:  Skipping  node mcn26.chicagobooth.edu 
 ( )


It seems that it is not able to read the state of the node. I ran scontrol 
show node mcn26

NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
    NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8

Any idea what happened and why nhc is not reading the state of the node 
anymore?


What's the complete output of "scontrol show node mcn26", especially the 
State=... information?


Which version of NHC are you running?

/Ole







Re: [slurm-users] node health check

2023-01-31 Thread Brian Johanson


On 1/30/23 10:35 PM, Ratnasamy, Fritz wrote:

Hi,

 Currently, some of our nodes are overloaded. The nhc installed used 
to check the load and drain the node when it is overloaded. However, 
for the past few  days, it is not showing the state of the node. When 
I run /usr/sbin/nhc manually, it says
20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online 
mcn26.chicagobooth.edu 
/usr/libexec/nhc/node-mark-online:  Not sure how to handle node state 
"" on mcn26.chicagobooth.edu 
/usr/libexec/nhc/node-mark-online:  Skipping  node 
mcn26.chicagobooth.edu  ( )


It seems that it is not able to read the state of the node. I ran 
scontrol show node mcn26

NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
   NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8

Any idea what happened and why nhc is not reading the state of the 
node anymore?



nhc is using the FQDN, slurm isn't (NodeHostName=mcn26), the query is 
failing.


We have a line 'export HOSTNAME=$(hostname -s)' in /etc/sysconfig/nhc


-b