[slurm-users] Intermittent "Not responding" status

2017-12-04 Thread Stradling, Alden Reid (ars9ac)
I have a number of nodes that have, after our transition to Centos 7.3/SLURM 17.02, begun to occasionally display a status of "Not responding". The health check we run on each node every 5 minutes detects nothing, and the nodes are perfectly healthy once I set their state to "idle". The slurmd c

Re: [slurm-users] Intermittent "Not responding" status

2017-12-04 Thread Paul Edmon
I've seen this happen when there are internode communications issues which disrupt the tree that slurm uses to talk to the nodes and do heartbeat.  We have this happen occassionally in our environment as we have nodes that are two geographically seperate facilities and the latency is substantia

Re: [slurm-users] Intermittent "Not responding" status

2017-12-04 Thread Chris Samuel
On Tuesday, 5 December 2017 5:57:59 AM AEDT Stradling, Alden Reid (ars9ac) wrote: > I have a number of nodes that have, after our transition to Centos 7.3/SLURM > 17.02, begun to occasionally display a status of "Not responding". I'd suggest checking in your slurmd and slurmctld logs to see if a