Re: [slurm-users] [External] Re: Draining hosts because of failing jobs

2021-05-04 Thread Prentice Bisbal
I haven't thought about it too hard, but the default NHC scripts do not notice that. That's the problem with NHC and any other problem-checking script: You have to tell them what errors to check for. As you errors occur, those scripts inevitably grow longer. -- Prentice On 5/4/21 12:47

Re: [slurm-users] [External] Re: Questions about adding new nodes to Slurm

2021-05-04 Thread Prentice Bisbal
I agree that people are making updating slurm.conf a bigger issue than people are making it out to be. However, there are certain config changes that do require restarting the daemon rather than just doing 'scontrol reconfigure.' these options are documented in the slurm.conf documentation

[slurm-users] Replacement for diamond

2021-05-04 Thread Paul Edmon
Python diamond has historically been really useful for shipping data to graphite.  We have a bunch of diamond collectors we wrote for slurm as a result: https://github.com/fasrc/slurm-diamond-collector  However with python 2 being end of life and diamond being unavailable for python 3 we need

Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Alex Chekholko
In my most recent experience, I have some SSDs in compute nodes that occasionally just drop off the bus, so the compute node loses its OS disk. I haven't thought about it too hard, but the default NHC scripts do not notice that. Similarly, Paul's proposed script might need to also check that the

Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Paul Edmon
Since you can run an arbitrary script as a node health checker I might add a script that counts failures and then closes if it hits a threshold.  The script shouldn't need to talk to the slurmctld or slurmdbd as it should be able to watch the log on the node and see the fail. -Paul Edmon- On

[slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Gerhard Strangar
Hello, how do you implement something like "drain host after 10 consecutive failed jobs"? Unlike a host check script, that checks for known errors, I'd like to stop killing jobs just because one node is faulty. Gerhard

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-05-04 Thread Ole Holm Nielsen
The task of adding or removing nodes from Slurm is well documented and discussed in SchedMD presentations, please see my Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes /Ole On 04-05-2021 14:47, Tina Friedrich wrote: Not sure if that's changed but aren't there cases

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-05-04 Thread Tina Friedrich
Not sure if that's changed but aren't there cases where 'scontrol reconfigure' isn't sufficient? (Like adding nodes?) But yes, that's my point exactly; it is a pretty basic day to day task to update slurm.conf, not some daunting operation that requires a downtime or anything like it. (I

Re: [slurm-users] [External] Re: safe to delete old QOSes?

2021-05-04 Thread Tina Friedrich
Rename as in my default QoS used to be called 'normal' and is now called 'standard' (same settings). (We refer to it as 'standard' in our SLA, it should never have been called 'normal' really.) As I don't think you can actually rename them, what I did was add a copy of 'normal' QoS (called

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-05-04 Thread Sid Young
You can push a new conf file and issue an "scontrol reconfigure" on the fly as needed... I do it on our cluster as needed, do the nodes first then login nodes then the slurm controller... you are making a huge issue of a very basic task... Sid On Tue, 4 May 2021, 22:28 Tina Friedrich, wrote:

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-05-04 Thread Tina Friedrich
Hello, a lot of people already gave very good answer to how to tackle this. Still, I thought it worth pointing this out - you said 'you need to basically shut down slurm, update the slurm.conf file, then restart'. That makes it sound like a major operation with lots of prep required. It's

Re: [slurm-users] SLURM 20.11.0 no x11 forwarding.

2021-05-04 Thread Tina Friedrich
No idea if I replied this to this particular thread already (if I have sorry for the duplicate). I had issues getting X forwarding to work with SLURM at the start. Worked via SSH; no authentication / xauth problems when doing via SLURM. Turned out to be caused by the nodes having their