Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread Steve Bland
Thanks Diego actually, nothing at all in the hosts file, did not seem to need to modify it to see the nodes. the different case on one of the nodes was an experiment to see if the names were in fact case-sensitive but all networking functions between the nodes, with say munge, all seem to work

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread Steve Bland
Thanks Chris When I did that, they all came back. Also found that in slurm.conf, ReturnToService was set to 0, so modified that for now. May turn it back to 0 to see if any nodes are lost, but I assume that will be in the log Interestingly I had this in slurm.conf, thought that would make the

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread Steve Bland
Although, in testing, even with ReturnToService set to '1', on a restart the system sees the node has come back in the logs, but it is still classified as down so will not take jobs until manually told otherwise [2020-11-30T10:33:05.402] debug2: node_did_resp SRVGRIDSLURM01 [2020-11-30T10:33:05

Re: [slurm-users] [EXT] Re: [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread Sean Crosby
You showed that firewalld is off, but that doesn't really prove on Centos7/RHEL7 that there is no firewall. What is the output of iptables -S I'd also try doing # scontrol show config | grep -i SlurmdPort SlurmdPort = 6818 And whatever port is shown, from the compute nodes, try co

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread mercan
Hi; Did you test munge connection? If not, would you test it like this munge -n | ssh  SRVGRIDSLURM02  unmunge Ahmet M. 30.11.2020 14:43 tarihinde Steve Bland yazdı: Thanks Diego actually, nothing at all in the hosts file, did not seem to need to modify it to see the nodes. the differe

[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Robert Kudyba
I've seen where this was a bug that was fixed https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally still. A user cancels his/her job and a node gets drained. UnkillableStepTimeout=120 is set in slurm.conf Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2 Slurm Job_i

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Paul Edmon
That can help.  Usually this happens due to laggy storage the job is using taking time flushing the job's data.  So making sure that your storage is up, responsive, and stable will also cut these down. -Paul Edmon- On 11/30/2020 12:52 PM, Robert Kudyba wrote: I've seen where this was a bug tha

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Robert Kudyba
Sure I've seen that in some of the posts here, e.g., a NAS. But in this case it's a NFS share to the local RAID10 storage. There aren't any other settings that deal with this to not drain a node? On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon wrote: > That can help. Usually this happens due to lagg

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Alex Chekholko
This may be more "cargo cult" but I've advised users to add a "sleep 60" to the end of their job scripts if they are "I/O intensive". Sometimes they are somehow able to generate I/O in a way that slurm thinks the job is finished, but the OS is still catching up on the I/O, and then slurm tries to

[slurm-users] Slurm SC20 Birds-of-a-Feather presentation online

2020-11-30 Thread Tim Wickberg
The roadmap presentation from the SC20 Birds-of-a-Feather session is online now: https://slurm.schedmd.com/SC20/BoF.pdf There is also a recording of the BoF including the Q+A session with Tim and Danny that will remain available through the SC20 virtual platform for the next few months. Pleas