Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-27 Thread Chris Samuel

On 26/11/20 9:21 am, Steve Bland wrote:


Sinfo always returns nodes not responding


One thing - do the nodes return to this state when you resume them with 
"scontrol update node=srvgridslurm[01-03] state=resume" ?


If they do then what does your slurmctld logs say for the reason for this?

You can bump up the log level on your slurmctld with (for instance 
"scontrol setdebug debug" for more info (we run ours at debug all the 
time anyway).


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-26 Thread Andy Riebs

1. Look for a firewall on all of your slurm -- they almost always break
   slurm communications.
2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly,
   "srvgridslurm01"

Andy

On 11/26/2020 12:21 PM, Steve Bland wrote:


Sinfo always returns nodes not responding

[root@srvgridslurm03 ~]# sinfo -R

REASON   USER TIMESTAMP   NODELIST

Not responding   slurm 2020-11-26T09:12:58 SRVGRIDSLURM01

Not responding   slurm 2020-11-26T08:27:58 SRVGRIDSLURM02

Not responding   slurm 2020-11-26T10:00:14 srvgridslurm03

By tailing the log for slurmctld,  I can see when a node is recognized

Node srvgridslurm03 now responding

By turning up the logging levels I can see comm between slurmctld and 
the nodes and there appears to be a response


[2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01

[2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3

[2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02

[2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03

[2020-11-26T12:05:14.335] debug2: Tree head got back 1

[2020-11-26T12:05:14.335] debug2: Tree head got back 2

[2020-11-26T12:05:14.336] debug2: Tree head got back 3

[2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01

[2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02

[2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03

What I do not understand is the disjoint. It seems to record 
responses, but flags the node as not responding – all nodes. There are 
only three right now as this is a test environment. 3 Centos7 systems


[root@SRVGRIDSLURM01 ~]# scontrol show node

NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4

   CPUAlloc=0 CPUTot=4 CPULoad=0.01

   AvailableFeatures=(null)

   ActiveFeatures=(null)

   Gres=(null)

   NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 Version=20.11.0

   OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020

   RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1

   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

   Partitions=debug

   BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25

   CfgTRES=cpu=4,mem=7821M,billing=4

   AllocTRES=

   CapWatts=n/a

   CurrentWatts=0 AveWatts=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

   Reason=Not responding [slurm@2020-11-26T09:12:58]

   Comment=(null)

NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4

   CPUAlloc=0 CPUTot=4 CPULoad=0.01

   AvailableFeatures=(null)

   ActiveFeatures=(null)

   Gres=(null)

   NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 Version=20.11.0

   OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020

   RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1

   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

   Partitions=debug

   BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08

   CfgTRES=cpu=4,mem=7821M,billing=4

   AllocTRES=

   CapWatts=n/a

   CurrentWatts=0 AveWatts=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

   Reason=Not responding [slurm@2020-11-26T08:27:58]

   Comment=(null)

NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4

   CPUAlloc=0 CPUTot=4 CPULoad=0.01

   AvailableFeatures=(null)

   ActiveFeatures=(null)

   Gres=(null)

   NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 Version=20.11.0

   OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020

   RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1

   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

   Partitions=debug

   BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23

   CfgTRES=cpu=4,mem=7821M,billing=4

   AllocTRES=

   CapWatts=n/a

   CurrentWatts=0 AveWatts=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

   Reason=Not responding [slurm@2020-11-26T10:00:14]

   Comment=(null)

Any suggestions? Thanks

--

This e-mail and any attachments may contain information that is 
confidential to Ross Video.


If you are not the intended recipient, please notify me immediately by 
replying to this message. Please also delete all copies. Thank you. 


[slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-26 Thread Steve Bland

Sinfo always returns nodes not responding
[root@srvgridslurm03 ~]# sinfo -R
REASON   USER  TIMESTAMP   NODELIST
Not responding   slurm 2020-11-26T09:12:58 SRVGRIDSLURM01
Not responding   slurm 2020-11-26T08:27:58 SRVGRIDSLURM02
Not responding   slurm 2020-11-26T10:00:14 srvgridslurm03


By tailing the log for slurmctld,  I can see when a node is recognized
Node srvgridslurm03 now responding


By turning up the logging levels I can see comm between slurmctld and the nodes 
and there appears to be a response

[2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01
[2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3
[2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02
[2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03
[2020-11-26T12:05:14.335] debug2: Tree head got back 1
[2020-11-26T12:05:14.335] debug2: Tree head got back 2
[2020-11-26T12:05:14.336] debug2: Tree head got back 3
[2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01
[2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02
[2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03

What I do not understand is the disjoint. It seems to record responses, but 
flags the node as not responding - all nodes. There are only three right now as 
this is a test environment. 3 Centos7 systems

[root@SRVGRIDSLURM01 ~]# scontrol show node
NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUTot=4 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 Version=20.11.0
   OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
   RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25
   CfgTRES=cpu=4,mem=7821M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [slurm@2020-11-26T09:12:58]
   Comment=(null)

NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUTot=4 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 Version=20.11.0
   OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
   RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08
   CfgTRES=cpu=4,mem=7821M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [slurm@2020-11-26T08:27:58]
   Comment=(null)

NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUTot=4 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 Version=20.11.0
   OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
   RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23
   CfgTRES=cpu=4,mem=7821M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [slurm@2020-11-26T10:00:14]
   Comment=(null)

Any suggestions? Thanks


--

This e-mail and any attachments may contain information that is confidential to 
Ross Video.

If you are not the intended recipient, please notify me immediately by replying 
to this message. Please also delete all copies. Thank you.