Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-02 Thread Brian Haymore
Are you running slurmdbd in your current setup?  If you are then the upgrade 
path there might have additional considerations moving this far in versions.

--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112
Phone: 801-558-1150, Fax: 801-585-5366
http://bit.ly/1HO1N2C

From: slurm-users  on behalf of Nathan 
Smith 
Sent: Wednesday, February 2, 2022 2:38 PM
To: slurm-us...@schedmd.com 
Subject: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information


The "Upgrades" section of the quick-start guide [0] warns:

> Slurm permits upgrades to a new major release from the past two major
> releases, which happen every nine months (e.g. 20.02.x or 20.11.x to
> 21.08.x) without loss of jobs or other state information. State
> information from older versions will not be recognized and will be
> discarded, resulting in loss of all running and pending jobs.

We are planning for an upgrade from 17.02.11 to 21.08.2. As a part of
our upgrade procedure we'd be bringing the scheduler to full stop, so
the loss of running and pending jobs would not be a concern. Is there
anything more to state information than running and pending jobs? For
example, would the JobID count revert to 1 in the case of such an
upgrade?

[0] https://slurm.schedmd.com/quickstart_admin.html#upgrade

--
Nathan Smith
Research Systems Engineer
Advanced Computing Center
Oregon Health & Science University


Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-02 Thread Brian Andrus

I actually just did that path for a children's hospital.

It was fairly straight-forward. Running jobs were not affected.

You do need to go 17->18->19->20->21

This is because there were changes in the db schema.

If you plan on bringing everything to a stop (no running jobs), you 
should be good. You will still need to do the incremental for the db 
changes, but no worries about state files either way.


Brian Andrus

On 2/2/2022 1:38 PM, Nathan Smith wrote:

The "Upgrades" section of the quick-start guide [0] warns:


Slurm permits upgrades to a new major release from the past two major
releases, which happen every nine months (e.g. 20.02.x or 20.11.x to
21.08.x) without loss of jobs or other state information. State
information from older versions will not be recognized and will be
discarded, resulting in loss of all running and pending jobs.

We are planning for an upgrade from 17.02.11 to 21.08.2. As a part of
our upgrade procedure we'd be bringing the scheduler to full stop, so
the loss of running and pending jobs would not be a concern. Is there
anything more to state information than running and pending jobs? For
example, would the JobID count revert to 1 in the case of such an
upgrade?

[0] https://slurm.schedmd.com/quickstart_admin.html#upgrade





Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-03 Thread Nathan Smith
Yes, we are running slurmdbd. We could arrange enough downtime to do an 
incremental upgrade of major versions as Brian Andrus suggested, at least on 
the slurmctld and slurmdbd systems. The slurmds I would just do a direct 
upgrade once the scheduler work was completed.

--
Nathan Smith
Research Systems Engineer
Advanced Computing Center
Oregon Health & Science University

From: slurm-users  On Behalf Of Brian 
Haymore
Sent: Wednesday, February 2, 2022 1:51 PM
To: slurm-us...@schedmd.com; Slurm User Community List 

Subject: [EXTERNAL] Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and 
state information

Are you running slurmdbd in your current setup?  If you are then the upgrade 
path there might have additional considerations moving this far in versions.

--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112
Phone: 801-558-1150, Fax: 801-585-5366
http://bit.ly/1HO1N2C<https://urldefense.com/v3/__http:/bit.ly/1HO1N2C__;!!Mi0JBg!eqxyactyQJqJ7Bwy-LEQT4WeJrmjDkqZxfwNtCBk_zliQifvEt1RQj4RYjUwe98$>

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Nathan Smith mailto:smina...@ohsu.edu>>
Sent: Wednesday, February 2, 2022 2:38 PM
To: slurm-us...@schedmd.com<mailto:slurm-us...@schedmd.com> 
mailto:slurm-us...@schedmd.com>>
Subject: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information


The "Upgrades" section of the quick-start guide [0] warns:

> Slurm permits upgrades to a new major release from the past two major
> releases, which happen every nine months (e.g. 20.02.x or 20.11.x to
> 21.08.x) without loss of jobs or other state information. State
> information from older versions will not be recognized and will be
> discarded, resulting in loss of all running and pending jobs.

We are planning for an upgrade from 17.02.11 to 21.08.2. As a part of
our upgrade procedure we'd be bringing the scheduler to full stop, so
the loss of running and pending jobs would not be a concern. Is there
anything more to state information than running and pending jobs? For
example, would the JobID count revert to 1 in the case of such an
upgrade?

[0] 
https://slurm.schedmd.com/quickstart_admin.html#upgrade<https://urldefense.com/v3/__https:/slurm.schedmd.com/quickstart_admin.html*upgrade__;Iw!!Mi0JBg!eqxyactyQJqJ7Bwy-LEQT4WeJrmjDkqZxfwNtCBk_zliQifvEt1RQj4RNExvAfw$>

--
Nathan Smith
Research Systems Engineer
Advanced Computing Center
Oregon Health & Science University


Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-03 Thread Ole Holm Nielsen

On 03-02-2022 16:37, Nathan Smith wrote:
Yes, we are running slurmdbd. We could arrange enough downtime to do an 
incremental upgrade of major versions as Brian Andrus suggested, at 
least on the slurmctld and slurmdbd systems. The slurmds I would just do 
a direct upgrade once the scheduler work was completed.


As Brian Andrus said, you must upgrade Slurm by at most 2 major 
versions, and that includes slurmd's as well!  Don't do a "direct 
upgrade" of slurmd by more than 2 versions!


I recommend separate physical servers for slurmdbd and slurmctld.  Then 
you can upgrade slurmdbd without taking the cluster offline.  It's OK 
for slurmdbd to be down for many hours, since slurmctld caches the state 
information in the meantime.


I've described the Slurm upgrade process in detail in my Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

Since you start from 17.02, you have to be extremely cautious when 
upgrading the database!  See the Wiki page for details.  Make sure to 
test the database upgrade on a test server, using a database dump in 
stead of the real slurmdbd server.


I hope this helps.

/Ole

*From:* slurm-users  *On Behalf 
Of *Brian Haymore

*Sent:* Wednesday, February 2, 2022 1:51 PM
*To:* slurm-us...@schedmd.com; Slurm User Community List 

*Subject:* [EXTERNAL] Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 
and state information


Are you running slurmdbd in your current setup?  If you are then the 
upgrade path there might have additional considerations moving this far 
in versions.


--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112
Phone: 801-558-1150, Fax: 801-585-5366
http://bit.ly/1HO1N2C 
<https://urldefense.com/v3/__http:/bit.ly/1HO1N2C__;!!Mi0JBg!eqxyactyQJqJ7Bwy-LEQT4WeJrmjDkqZxfwNtCBk_zliQifvEt1RQj4RYjUwe98$>




*From:*slurm-users <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Nathan 
Smith mailto:smina...@ohsu.edu>>

*Sent:* Wednesday, February 2, 2022 2:38 PM
*To:* slurm-us...@schedmd.com <mailto:slurm-us...@schedmd.com> 
mailto:slurm-us...@schedmd.com>>
*Subject:* [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state 
information



The "Upgrades" section of the quick-start guide [0] warns:

 > Slurm permits upgrades to a new major release from the past two major
 > releases, which happen every nine months (e.g. 20.02.x or 20.11.x to
 > 21.08.x) without loss of jobs or other state information. State
 > information from older versions will not be recognized and will be
 > discarded, resulting in loss of all running and pending jobs.

We are planning for an upgrade from 17.02.11 to 21.08.2. As a part of
our upgrade procedure we'd be bringing the scheduler to full stop, so
the loss of running and pending jobs would not be a concern. Is there
anything more to state information than running and pending jobs? For
example, would the JobID count revert to 1 in the case of such an
upgrade?

[0] https://slurm.schedmd.com/quickstart_admin.html#upgrade 
<https://urldefense.com/v3/__https:/slurm.schedmd.com/quickstart_admin.html*upgrade__;Iw!!Mi0JBg!eqxyactyQJqJ7Bwy-LEQT4WeJrmjDkqZxfwNtCBk_zliQifvEt1RQj4RNExvAfw$>


--
Nathan Smith
Research Systems Engineer
Advanced Computing Center
Oregon Health & Science University





Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-03 Thread Ryan Novosielski


> On Feb 3, 2022, at 2:55 PM, Ole Holm Nielsen  
> wrote:
> 
> On 03-02-2022 16:37, Nathan Smith wrote:
>> Yes, we are running slurmdbd. We could arrange enough downtime to do an 
>> incremental upgrade of major versions as Brian Andrus suggested, at least on 
>> the slurmctld and slurmdbd systems. The slurmds I would just do a direct 
>> upgrade once the scheduler work was completed.
> 
> As Brian Andrus said, you must upgrade Slurm by at most 2 major versions, and 
> that includes slurmd's as well!  Don't do a "direct upgrade" of slurmd by 
> more than 2 versions!
> 
> I recommend separate physical servers for slurmdbd and slurmctld.  Then you 
> can upgrade slurmdbd without taking the cluster offline.  It's OK for 
> slurmdbd to be down for many hours, since slurmctld caches the state 
> information in the meantime.

The one thing you want to watch out for here – maybe more so if you are using a 
VM than a physical server as you may have sized the RAM for how much slurmctld 
appears to need, as we did – is that that caching that takes place on the 
slurmctld uses memory (I guess obviously, when you think about it). The result 
there can be that eventually if you have slurmd down for a long time (we had 
someone who was hitting a bug that would start running jobs right after 
everyone went to sleep for example), your slurmctld can run out of memory, 
crash, and then that cache is lost. You don’t normally see that memory being 
used like that, because slurmdbd is normally up/accepting the accounting data.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-04 Thread Bjørn-Helge Mevik
Ole Holm Nielsen  writes:

> As Brian Andrus said, you must upgrade Slurm by at most 2 major
> versions, and that includes slurmd's as well!  Don't do a "direct 
> upgrade" of slurmd by more than 2 versions!

That should only be an issue if you have running jobs during the
upgrade, shouldn't it?  As I understand it, without any running jobs,
you can do pretty much what you want on the compute nodes.  Or am I
missing something here?

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-04 Thread Ole Holm Nielsen

On 04-02-2022 08:59, Bjørn-Helge Mevik wrote:

Ole Holm Nielsen  writes:


As Brian Andrus said, you must upgrade Slurm by at most 2 major
versions, and that includes slurmd's as well!  Don't do a "direct
upgrade" of slurmd by more than 2 versions!


That should only be an issue if you have running jobs during the
upgrade, shouldn't it?  As I understand it, without any running jobs,
you can do pretty much what you want on the compute nodes.  Or am I
missing something here?


I think that Slurm's communication protocol is incompatible when 
versions differ by more than 2.  So the slurmd daemons may possibly lose 
contact with the slurmctld in that case.


In my experience, it's not a problem to upgrade slurmd while the nodes 
are running jobs: Upgrade the slurmd RPM, and slurmd will restart itself 
and attach to the running jobs.  There are probably cases where this 
will cause job crashes, so please heed the information collected in the 
Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-on-centos-7

There may be some issues with MPI applications as mentioned in the Wiki.

/Ole



Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-04 Thread Ole Holm Nielsen

On 03-02-2022 21:59, Ryan Novosielski wrote:

On Feb 3, 2022, at 2:55 PM, Ole Holm Nielsen  wrote:

On 03-02-2022 16:37, Nathan Smith wrote:

Yes, we are running slurmdbd. We could arrange enough downtime to do an 
incremental upgrade of major versions as Brian Andrus suggested, at least on 
the slurmctld and slurmdbd systems. The slurmds I would just do a direct 
upgrade once the scheduler work was completed.


As Brian Andrus said, you must upgrade Slurm by at most 2 major versions, and that 
includes slurmd's as well!  Don't do a "direct upgrade" of slurmd by more than 
2 versions!

I recommend separate physical servers for slurmdbd and slurmctld.  Then you can 
upgrade slurmdbd without taking the cluster offline.  It's OK for slurmdbd to 
be down for many hours, since slurmctld caches the state information in the 
meantime.


The one thing you want to watch out for here – maybe more so if you are using a 
VM than a physical server as you may have sized the RAM for how much slurmctld 
appears to need, as we did – is that that caching that takes place on the 
slurmctld uses memory (I guess obviously, when you think about it). The result 
there can be that eventually if you have slurmd down for a long time (we had 
someone who was hitting a bug that would start running jobs right after 
everyone went to sleep for example), your slurmctld can run out of memory, 
crash, and then that cache is lost. You don’t normally see that memory being 
used like that, because slurmdbd is normally up/accepting the accounting data.


The slurmctld caches job state information in:
# scontrol show config | grep StateSaveLocation
StateSaveLocation   = /var/spool/slurmctld

The StateSaveLocation should retain job information even if slurmctld 
crashes (at least the data which have been committed to disk).


The StateSaveLocation file system must not fill up, of course!  There 
are also some upper limits to the number of records in 
StateSaveLocation, but I can't find the numbers right now.


/Ole