[slurm-users] Re: Restarting slurmd kills still-running jobs

Brian Andrus via slurm-users Thu, 12 Feb 2026 16:01:39 -0800

That smells like the munge key was changed, which would require thebehavior you see.


Brian Andrus


On 2/12/2026 11:56 AM, Griebel, Christian via slurm-users wrote:

Dear community,
Trying to implement the latest fix/patch for munged, we restarted theupdated munged locally on the compute nodes with "systemctl restartmunged", resulting in the sudden death of a lot of compute nodes' slurmd.
Checking the jobs on the affected nodes, we saw a lot ofuser processes/jobs still running, which was good - yet "systemctlrestart slurmd" cancelled all of them, eg.
[2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern
[2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 toStepId=49695760.extern
[2026-02-12T17:08:00.325] slurmd version 25.05.5 started
and all affected user jobs (even though having survived the death oftheir parent slurmd) were killed and re-queued...
We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmdunit and "ProctrackType=proctrack/cgroup" configured.
Other sites do not see the same behavior (their user jobs survive aslurmd restart without issues), so now we are at a loss figuring outwhy the h.... this happens within our setup.
Anyone experienced similar problems and got them solved...?


Thanks in advance -

--
___________________________
Christian Griebel/HPC

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Restarting slurmd kills still-running jobs

Reply via email to