I think those warnings are for the overly cautious. Certainly we have
never waited for all jobs to exit before upgrading. Out of paranoia we
pause all our jobs, but that is not required. Typically you can upgrade
between versions without pausing or canceling jobs. That said you will
want to look at the release notes and changelog for the version you want
to upgrade to in case there is any issue that is flagged that requires
more paranoia. Generally minor version upgrades are fine.
The thing I would note though is this phrase "/Any jobs started before
the slurmd system was upgraded will be running with the old version of
slurmstepd, so starting another upgrade or trying to use new features in
the new version may cause problems."/ What really this is noting is that
upgrading in quick succession (especially major upgrades) could be
problematic. So say you were to go from 22.05.3 -> 23.11 and then
immediately go to 25.05, that could cause problems. If you intend to go
from your current version to the latest I recommend spacing out the
upgrades, or taking a full downtime.
That said I have never done an upgrade over that large a version change
so some one with more experience on the list should be able to answer
any questions related to that. My gut says though that if I were trying
to step to the latest version I would either clear out the existing
jobs, or I would do one upgrade per week to give the jobs on the cluster
time to adjust to the new version.
-Paul Edmon-
On 1/20/2026 2:42 PM, Gould, Ron (GRC-VBA0)[AEGIS] via slurm-users wrote:
The tl;dr is “This is my first upgrade since inheriting this Cluster,
so I’m not sure what can or can’t be running during the upgrades.”.
My Cluster is running an old version, 22.05.3. This is my first
upgrade since inheriting the Cluster. As such, I’d like to install
22.05.4 because it’s a short jump, and it fixes the bug my users are
seeing.
The Cluster is composed of mostly Oracle Linux 8. I’m aware that I can
upgrade within the two release compatibility window. I’ve read through
the Upgrade guide and I’m unclear if downtime is required. Perhaps I’m
unifying downtime requirements across different SLURM services where I
should be interpreting that certain services have their own downtime
requirements.
https://slurm.schedmd.com/upgrades.html
<https://slurm.schedmd.com/upgrades.html#procedure>
In the Upgrade Procedure section
<https://slurm.schedmd.com/upgrades.html#procedure>, there’re a couple
questionable things.
1.
Is downtime required? Does downtime == “all jobs must be halted”?
“Downtime”, to me, seems like nothing should be running. This
statement indicates that jobs can be running during the upgrade.
/
/
/ Before considering the upgrade complete, /*/wait for all jobs
that were already running to finish/*/. Any jobs started before
the slurmd system was upgraded will be running with the old
version of slurmstepd, so starting another upgrade or trying to
use new features in the new version may cause problems./
within a few paragraphs
<https://slurm.schedmd.com/upgrades.html#downtime>, this message
indicates I will need downtime:
/
/
/ Refer to the expected downtime guidance in the following
sections for each relevant Slurm daemon/
/
/
Further in the guide, in SLURMD (COMPUTE NODES)
<https://slurm.schedmd.com/upgrades.html#slurmd>, I read
/Upgrades will not interrupt running jobs as long as
SlurmdTimeout is not reached during the process/
This implies, at least, that existing running jobs can stay running.
2.
There’re conflicting suggestions of using “rpm” to install the
RPMs I built with “rpmbuild". Should I use “dnf localinstall
./*.rpm”? I’m inferring that dependencies will (not) be handled
correctly.
/
/
/ NOTE: If RPM/DEB packages are used, all packages present on
each system must be upgraded together instead of piecewise. …
/*/Avoid using low-level package managers like rpm or dpkg /*/as
they may not properly enforce these dependencies/
However, in SLURMDBD (ACCOUNTING)
<https://slurm.schedmd.com/upgrades.html#slurmdbd>, this statement
/Upgrade the slurmdbd daemon binaries, libraries, and its
systemd unit file (if used). If using RPM/DEB packages, the
package manager will take care of these/
indicates I should be using RPM packages.
Lastly, to get to a current install, I need to step through multiple
versions, with the condition that jobs started with a specific major
version must finish within the compatibility window. GitLab has a tool
where you plug in your current and intended versions and it tells you
explicitly which versions are required along the upgrade path. I’d
like a similarly explicit tool for SLURM, but I infer from the
Compatibility Window
<https://slurm.schedmd.com/upgrades.html#compatibility_window> that I
can update like so:
1.
Current = 22.05.3
2.
23.11
3.
25.05
4.
26.05
That feels like a big leapfrog between versions. I’d like the practice
of upgrading. Is there any detriment to upgrading at a slower pace:
1.
Current = 22.05.3
2.
22.05.11
3.
23.02.8
4.
23.11.11
5.
24.05.8
6.
24.11.7
7.
25.05.6
8.
25.11.2
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]