Re: [slurm-users] Set a ramdom offset when starting node health check in SLURM

2020-11-27 Thread Bjørn-Helge Mevik
You can also check out

HealthCheckNodeState=CYCLE

man slurm.conf:

"Rather than running the health check program on all nodes at the same
time, cycle through running on all compute nodes through the course of
the HealthCheckInterval. May be combined with the various node state
options."

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-27 Thread Steve Bland

Andy

I appreciate you making me check again, things do get missed

SELinux is off, firewalld is disabled


[root@SRVGRIDSLURM01 ~]# sestatus

SELinux status: disabled

[root@SRVGRIDSLURM01 ~]# systemctl status firewalld

● firewalld.service - firewalld - dynamic firewall daemon

   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor 
preset: enabled)

   Active: inactive (dead)

 Docs: man:firewalld(1)

The one thing I can think of is that the system running  slurmctld has two 
network interfaces. It serves as a gateway, so has two network address. The two 
of the test slurmd's are on the other side of that gateway box, one is on the 
same box. But the two on the other side of the gateway, have a different IP 
address range and possibly mask

this is from slurm.conf for the three nodes. I know they are talking; I can see 
it in the logs when set to a debug logging level
the nodename info comes from slurmd -C, so that is correct
added the IP address, but that did not matter


# COMPUTE NODES

NodeName=SRVGRIDSLURM01 NodeAddr=192.168.1.60 CPUs=4 Boards=1 SocketsPerBoard=1 
CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821

NodeName=SRVGRIDSLURM02 NodeAddr=192.168.1.61 CPUs=4 Boards=1 SocketsPerBoard=1 
CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821

NodeName=srvgridslurm03 NodeAddr=192.168.1.62 CPUs=4 Boards=1 SocketsPerBoard=1 
CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821

PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

about the only thing I can think of is to make one of the nodes on the 
otherside of the gateway into the control node



Steve Bland
Technical Product Manager

Third Party Products
Ross Video | Production Technology Experts
T: +1 (613) 228-0688 ext.4219
www.rossvideo.com


From: Andy Riebs  on behalf of Andy Riebs 

Sent: 26 November 2020 13:40
To: Steve Bland ; Slurm User Community List 

Subject: Re: [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity 
issue between the slurmctld process and the slurmd nodes


One last shot on the firewall front Steve -- does the control node have a 
firewall enabled? I've seen cases where that can cause the sporadic messaging 
failures that you seem to be seeing.

That failing, I'll defer to anyone with different ideas!

Andy

On 11/26/2020 1:01 PM, Steve Bland wrote:
--

This e-mail and any attachments may contain information that is confidential to 
Ross Video.

If you are not the intended recipient, please notify me immediately by replying 
to this message. Please also delete all copies. Thank you.


Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-27 Thread Andy Riebs
Steve, you've exhausted my best ideas... hope someone else can jump in!

Andy

On Fri, Nov 27, 2020, 11:19 AM Steve Bland  wrote:

>
> Andy
>
> I appreciate you making me check again, things do get missed
>
> SELinux is off, firewalld is disabled
>
> [root@SRVGRIDSLURM01 ~]# sestatus
>
> SELinux status: disabled
>
> [root@SRVGRIDSLURM01 ~]# systemctl status firewalld
>
> ● firewalld.service - firewalld - dynamic firewall daemon
>
>Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled;
> vendor preset: enabled)
>
>Active: inactive (dead)
>
>  Docs: man:firewalld(1)
>
> The one thing I can think of is that the system running  slurmctld has two
> network interfaces. It serves as a gateway, so has two network address. The
> two of the test slurmd's are on the other side of that gateway box, one is
> on the same box. But the two on the other side of the gateway, have a
> different IP address range and possibly mask
>
> this is from slurm.conf for the three nodes. I know they are talking; I
> can see it in the logs when set to a debug logging level
> the nodename info comes from slurmd -C, so that is correct
> added the IP address, but that did not matter
>
> # COMPUTE NODES
>
> NodeName=SRVGRIDSLURM01 NodeAddr=192.168.1.60 CPUs=4 Boards=1
> SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
>
> NodeName=SRVGRIDSLURM02 NodeAddr=192.168.1.61 CPUs=4 Boards=1
> SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
>
> NodeName=srvgridslurm03 NodeAddr=192.168.1.62 CPUs=4 Boards=1
> SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
>
> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> about the only thing I can think of is to make one of the nodes on the
> otherside of the gateway into the control node
>
>
> *Steve Bland*
> *Technical Product Manager*
>
> *Third Party Products*
> Ross Video | Production Technology Experts
> T: +1 (613) 228-0688 ext.4219
> www.rossvideo.com
> --
> *From:* Andy Riebs  on behalf of Andy Riebs <
> a...@candooz.com>
> *Sent:* 26 November 2020 13:40
> *To:* Steve Bland ; Slurm User Community List <
> slurm-users@lists.schedmd.com>
> *Subject:* Re: [EXTERNAL] Re: [slurm-users] trying to diagnose a
> connectivity issue between the slurmctld process and the slurmd nodes
>
>
> One last shot on the firewall front Steve -- does the control node have a
> firewall enabled? I've seen cases where that can cause the sporadic
> messaging failures that you seem to be seeing.
>
> That failing, I'll defer to anyone with different ideas!
>
> Andy
> On 11/26/2020 1:01 PM, Steve Bland wrote:
>
> --
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by
> replying to this message. Please also delete all copies. Thank you.
>


Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-27 Thread Chris Samuel

On 26/11/20 9:21 am, Steve Bland wrote:


Sinfo always returns nodes not responding


One thing - do the nodes return to this state when you resume them with 
"scontrol update node=srvgridslurm[01-03] state=resume" ?


If they do then what does your slurmctld logs say for the reason for this?

You can bump up the log level on your slurmctld with (for instance 
"scontrol setdebug debug" for more info (we run ours at debug all the 
time anyway).


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA