Hi Ron,

I also am in the "paranoid" group. And I've always done updates with jobs
"live". Depending on the size of your userbase you may want to consider
pausing the submission/start of new jobs while you execute the dnf commands
(yes, I use them, rather than the "raw" rpm, because I think they are less
error prone with e.g. dependencies).
Since you are in the same group as myself, you can save a list of running
jobs before and after executing the dnf commands, and see if they match. If
they do, congratulations, everything went well. If they don't, there is a
(tiny) risk that the jobs which completed during that time might miss
"something". Examine their logs and/or warn the users as appropriate. To be
clear, this tiny risk is about jobs that would complete *on their own*
during that timeframe, not that the slurm update will cause healy jobs to
crash. What could happen is a race condition between the jobs terminating
and the slurm update which might try to update some information in some DB
in an inconsistent way. My understanding is that the job itself (e.g.
output file) are safe, it's just the slurm records which might get some
trouble.

You mention "waiting at least a week" between a subsequent update, but
really the key point is this

*Before considering the upgrade complete, **wait for all jobs that were
already running to finish**.*

Which means: if you have a 6h wallclock limit, you can wait only 7h. If you
have a 2 months wallclock limit you need to wait for a bit more than 2
months. If you don't have wallclock limit.... you may have to wait
forever.... Wait! You have the list of jobs because you are paranoid like
myself and made one as mentioned above, so you have to wait "only" for all
of them to be completed before proceeding, not "forever".

With these precautions, most likely you won't encounter any issue (of
course that gets weighted with the size of the cluster: if you have a huge
one with hundreds of thousands of users/jobs/nodes, you will see things
that have 0.001% chance of happening and that most of us never encounter)

HTH.

On Tue, Jan 20, 2026 at 1:31 PM Ron Gould via slurm-users <
[email protected]> wrote:

> Thank you for that guidance. I am certainly in the "overly cautious" and
> "paranoid" groups.
>
> I will probably go through the slower upgrade process (1-8 list), with at
> least a week between them.
>
> And yes, if anyone has experience doing such a vault between versions,
> please chime in.
>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to