No idea if I replied this to this particular thread already (if I have
sorry for the duplicate).
I had issues getting X forwarding to work with SLURM at the start.
Worked via SSH; no authentication / xauth problems when doing via SLURM.
Turned out to be caused by the nodes having their hostna
Hello,
a lot of people already gave very good answer to how to tackle this.
Still, I thought it worth pointing this out - you said 'you need to
basically shut down slurm, update the slurm.conf file, then restart'.
That makes it sound like a major operation with lots of prep required.
It's no
You can push a new conf file and issue an "scontrol reconfigure" on the fly
as needed... I do it on our cluster as needed, do the nodes first then
login nodes then the slurm controller... you are making a huge issue of a
very basic task...
Sid
On Tue, 4 May 2021, 22:28 Tina Friedrich,
wrote:
>
Rename as in my default QoS used to be called 'normal' and is now called
'standard' (same settings). (We refer to it as 'standard' in our SLA, it
should never have been called 'normal' really.)
As I don't think you can actually rename them, what I did was add a copy
of 'normal' QoS (called 'st
Not sure if that's changed but aren't there cases where 'scontrol
reconfigure' isn't sufficient? (Like adding nodes?)
But yes, that's my point exactly; it is a pretty basic day to day task
to update slurm.conf, not some daunting operation that requires a
downtime or anything like it. (I rememb
The task of adding or removing nodes from Slurm is well documented and
discussed in SchedMD presentations, please see my Wiki page
https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes
/Ole
On 04-05-2021 14:47, Tina Friedrich wrote:
Not sure if that's changed but aren't there cases wh
Hello,
how do you implement something like "drain host after 10 consecutive
failed jobs"? Unlike a host check script, that checks for known errors,
I'd like to stop killing jobs just because one node is faulty.
Gerhard
Since you can run an arbitrary script as a node health checker I might
add a script that counts failures and then closes if it hits a
threshold. The script shouldn't need to talk to the slurmctld or
slurmdbd as it should be able to watch the log on the node and see the fail.
-Paul Edmon-
On
In my most recent experience, I have some SSDs in compute nodes that
occasionally just drop off the bus, so the compute node loses its OS disk.
I haven't thought about it too hard, but the default NHC scripts do not
notice that. Similarly, Paul's proposed script might need to also check
that the s
Python diamond has historically been really useful for shipping data to
graphite. We have a bunch of diamond collectors we wrote for slurm as a
result: https://github.com/fasrc/slurm-diamond-collector However with
python 2 being end of life and diamond being unavailable for python 3 we
need a
I agree that people are making updating slurm.conf a bigger issue than
people are making it out to be. However, there are certain config
changes that do require restarting the daemon rather than just doing
'scontrol reconfigure.' these options are documented in the slurm.conf
documentation (jus
I haven't thought about it too hard, but the default NHC scripts do
not notice that.
That's the problem with NHC and any other problem-checking script: You
have to tell them what errors to check for. As you errors occur, those
scripts inevitably grow longer.
--
Prentice
On 5/4/21 12:47 PM,
12 matches
Mail list logo