> On Feb 3, 2022, at 2:55 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> 
> wrote:
> 
> On 03-02-2022 16:37, Nathan Smith wrote:
>> Yes, we are running slurmdbd. We could arrange enough downtime to do an 
>> incremental upgrade of major versions as Brian Andrus suggested, at least on 
>> the slurmctld and slurmdbd systems. The slurmds I would just do a direct 
>> upgrade once the scheduler work was completed.
> 
> As Brian Andrus said, you must upgrade Slurm by at most 2 major versions, and 
> that includes slurmd's as well!  Don't do a "direct upgrade" of slurmd by 
> more than 2 versions!
> 
> I recommend separate physical servers for slurmdbd and slurmctld.  Then you 
> can upgrade slurmdbd without taking the cluster offline.  It's OK for 
> slurmdbd to be down for many hours, since slurmctld caches the state 
> information in the meantime.

The one thing you want to watch out for here – maybe more so if you are using a 
VM than a physical server as you may have sized the RAM for how much slurmctld 
appears to need, as we did – is that that caching that takes place on the 
slurmctld uses memory (I guess obviously, when you think about it). The result 
there can be that eventually if you have slurmd down for a long time (we had 
someone who was hitting a bug that would start running jobs right after 
everyone went to sleep for example), your slurmctld can run out of memory, 
crash, and then that cache is lost. You don’t normally see that memory being 
used like that, because slurmdbd is normally up/accepting the accounting data.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
     `'

Reply via email to