date:20210504

Re: [slurm-users] SLURM 20.11.0 no x11 forwarding.

2021-05-04 Thread Tina Friedrich

No idea if I replied this to this particular thread already (if I have sorry for the duplicate). I had issues getting X forwarding to work with SLURM at the start. Worked via SSH; no authentication / xauth problems when doing via SLURM. Turned out to be caused by the nodes having their hostna

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-05-04 Thread Tina Friedrich

Hello, a lot of people already gave very good answer to how to tackle this. Still, I thought it worth pointing this out - you said 'you need to basically shut down slurm, update the slurm.conf file, then restart'. That makes it sound like a major operation with lots of prep required. It's no

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-05-04 Thread Sid Young

You can push a new conf file and issue an "scontrol reconfigure" on the fly as needed... I do it on our cluster as needed, do the nodes first then login nodes then the slurm controller... you are making a huge issue of a very basic task... Sid On Tue, 4 May 2021, 22:28 Tina Friedrich, wrote: >

Re: [slurm-users] [External] Re: safe to delete old QOSes?

2021-05-04 Thread Tina Friedrich

Rename as in my default QoS used to be called 'normal' and is now called 'standard' (same settings). (We refer to it as 'standard' in our SLA, it should never have been called 'normal' really.) As I don't think you can actually rename them, what I did was add a copy of 'normal' QoS (called 'st

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-05-04 Thread Tina Friedrich

Not sure if that's changed but aren't there cases where 'scontrol reconfigure' isn't sufficient? (Like adding nodes?) But yes, that's my point exactly; it is a pretty basic day to day task to update slurm.conf, not some daunting operation that requires a downtime or anything like it. (I rememb

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-05-04 Thread Ole Holm Nielsen

The task of adding or removing nodes from Slurm is well documented and discussed in SchedMD presentations, please see my Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes /Ole On 04-05-2021 14:47, Tina Friedrich wrote: Not sure if that's changed but aren't there cases wh

[slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Gerhard Strangar

Hello, how do you implement something like "drain host after 10 consecutive failed jobs"? Unlike a host check script, that checks for known errors, I'd like to stop killing jobs just because one node is faulty. Gerhard

Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Paul Edmon

Since you can run an arbitrary script as a node health checker I might add a script that counts failures and then closes if it hits a threshold. The script shouldn't need to talk to the slurmctld or slurmdbd as it should be able to watch the log on the node and see the fail. -Paul Edmon- On

Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Alex Chekholko

In my most recent experience, I have some SSDs in compute nodes that occasionally just drop off the bus, so the compute node loses its OS disk. I haven't thought about it too hard, but the default NHC scripts do not notice that. Similarly, Paul's proposed script might need to also check that the s

[slurm-users] Replacement for diamond

2021-05-04 Thread Paul Edmon

Python diamond has historically been really useful for shipping data to graphite. We have a bunch of diamond collectors we wrote for slurm as a result: https://github.com/fasrc/slurm-diamond-collector However with python 2 being end of life and diamond being unavailable for python 3 we need a

Re: [slurm-users] [External] Re: Questions about adding new nodes to Slurm

2021-05-04 Thread Prentice Bisbal

I agree that people are making updating slurm.conf a bigger issue than people are making it out to be. However, there are certain config changes that do require restarting the daemon rather than just doing 'scontrol reconfigure.' these options are documented in the slurm.conf documentation (jus

Re: [slurm-users] [External] Re: Draining hosts because of failing jobs

2021-05-04 Thread Prentice Bisbal

I haven't thought about it too hard, but the default NHC scripts do not notice that. That's the problem with NHC and any other problem-checking script: You have to tell them what errors to check for. As you errors occur, those scripts inevitably grow longer. -- Prentice On 5/4/21 12:47 PM,

Re: [slurm-users] SLURM 20.11.0 no x11 forwarding.

Re: [slurm-users] Questions about adding new nodes to Slurm

Re: [slurm-users] Questions about adding new nodes to Slurm

Re: [slurm-users] [External] Re: safe to delete old QOSes?

Re: [slurm-users] Questions about adding new nodes to Slurm

Re: [slurm-users] Questions about adding new nodes to Slurm

[slurm-users] Draining hosts because of failing jobs

Re: [slurm-users] Draining hosts because of failing jobs

Re: [slurm-users] Draining hosts because of failing jobs

[slurm-users] Replacement for diamond

Re: [slurm-users] [External] Re: Questions about adding new nodes to Slurm

Re: [slurm-users] [External] Re: Draining hosts because of failing jobs

12 matches

Site Navigation

Mail list logo

Footer information