The tl;dr is “This is my first upgrade since inheriting this Cluster, so I’m 
not sure what can or can’t be running during the upgrades.”.

My Cluster is running an old version, 22.05.3. This is my first upgrade since 
inheriting the Cluster. As such, I’d like to install 22.05.4 because it’s a 
short jump, and it fixes the bug my users are seeing.

The Cluster is composed of mostly Oracle Linux 8. I’m aware that I can upgrade 
within the two release compatibility window. I’ve read through the Upgrade 
guide and I’m unclear if downtime is required. Perhaps I’m unifying downtime 
requirements across different SLURM services where I should be interpreting 
that certain services have their own downtime requirements.

https://slurm.schedmd.com/upgrades.html<https://slurm.schedmd.com/upgrades.html#procedure>

In the Upgrade Procedure 
section<https://slurm.schedmd.com/upgrades.html#procedure>, there’re a couple 
questionable things.


  1.
Is downtime required? Does downtime == “all jobs must be halted”? “Downtime”, 
to me, seems like nothing should be running. This statement indicates that jobs 
can be running during the upgrade.

    Before considering the upgrade complete, wait for all jobs that were 
already running to finish. Any jobs started before the slurmd system was 
upgraded will be running with the old version of slurmstepd, so starting 
another upgrade or trying to use new features in the new version may cause 
problems.

within a few paragraphs<https://slurm.schedmd.com/upgrades.html#downtime>, this 
message indicates I will need downtime:

    Refer to the expected downtime guidance in the following sections for each 
relevant Slurm daemon

Further in the guide, in SLURMD (COMPUTE 
NODES)<https://slurm.schedmd.com/upgrades.html#slurmd>, I read

    Upgrades will not interrupt running jobs as long as SlurmdTimeout is not 
reached during the process

This implies, at least, that existing running jobs can stay running.

  2.
There’re conflicting suggestions of using “rpm” to install the RPMs I built 
with “rpmbuild". Should I use “dnf localinstall ./*.rpm”? I’m inferring that 
dependencies will (not) be handled correctly.

    NOTE: If RPM/DEB packages are used, all packages present on each system 
must be upgraded together instead of piecewise. … Avoid using low-level package 
managers like rpm or dpkg as they may not properly enforce these dependencies

However, in SLURMDBD 
(ACCOUNTING)<https://slurm.schedmd.com/upgrades.html#slurmdbd>, this statement

    Upgrade the slurmdbd daemon binaries, libraries, and its systemd unit file 
(if used). If using RPM/DEB packages, the package manager will take care of 
these

indicates I should be using RPM packages.


Lastly, to get to a current install, I need to step through multiple versions, 
with the condition that jobs started with a specific major version must finish 
within the compatibility window. GitLab has a tool where you plug in your 
current and intended versions and it tells you explicitly which versions are 
required along the upgrade path. I’d like a similarly explicit tool for SLURM, 
but I infer from the Compatibility 
Window<https://slurm.schedmd.com/upgrades.html#compatibility_window> that I can 
update like so:

  1.
Current = 22.05.3
  2.
23.11
  3.
25.05
  4.
26.05

That feels like a big leapfrog between versions. I’d like the practice of 
upgrading. Is there any detriment to upgrading at a slower pace:

  1.
Current = 22.05.3
  2.
22.05.11
  3.
23.02.8
  4.
23.11.11
  5.
24.05.8
  6.
24.11.7
  7.
25.05.6
  8.
25.11.2
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to