Re: [slurm-users] Slurm Upgrade Philosophy?

2020-12-24 Thread Chris Samuel

On 24/12/20 6:24 am, Paul Edmon wrote:

We then have a test cluster that we install the release on a run a few 
test jobs to make sure things are working, usually MPI jobs as they tend 
to hit most of the features of the scheduler.


One thing I meant to mention last night was that we use Reframe from 
CSCS as the test framework for our systems, our user support folks 
maintain our local tests as they're best placed to understand the user 
requirements that need coverage and we feed in our system facing 
requirements to them so they can add tests for that side too.


https://reframe-hpc.readthedocs.io/

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Slurm Upgrade Philosophy?

2020-12-24 Thread Paul Edmon
We are the same way, though we tend to keep pace with minor releases.  
We typically wait until the .1 release of a new major release before 
considering upgrade so that many of the bugs are worked out.  We then 
have a test cluster that we install the release on a run a few test jobs 
to make sure things are working, usually MPI jobs as they tend to hit 
most of the features of the scheduler.


We also like to stay current with releases as there are new features we 
want, or features we didn't know we wanted but our users find and start 
using.  So our general methodology is to upgrade to the latest minor 
release at our next monthly maintenance.  For major releases we will 
upgrade at our next monthly maintenance after the .1 release is out 
unless there is a show stopping bug that we run into in our own 
testing.  At which point we file a bug with SchedMD and get a patch.


-Paul Edmon-

On 12/24/2020 1:57 AM, Chris Samuel wrote:

On Friday, 18 December 2020 10:10:19 AM PST Jason Simms wrote:


Thanks to several helpful members on this list, I think I have a much better
handle on how to upgrade Slurm. Now my question is, do most of you upgrade
with each major release?

We do, though not immediately and not without a degree of testing on our test
systems.  One of the big reasons for us upgrading is that we've usually paid
for features in Slurm for our needs (for example in 20.11 that includes
scrontab so users won't be tied to favourite login nodes, as well as  the
experimental RPC queue code due to the large numbers of RPCs our systems need
to cope with).

I also keep an eye out for discussions of what other sites find with new
releases too, so I'm following the current concerns about 20.11 and the change
in behaviour for job steps that do (expanding NVIDIA's example slightly):

#SBATCH --exclusive
#SBATCH -N2
srun --ntasks-per-node=1 python multi_node_launch.py

which (if I'm reading the bugs correctly) fails in 20.11 as that srun no
longer gets all the allocated resources, instead just gets the default of
--cpus-per-task=1 instead, which also affects things like mpirun in OpenMPI
built with Slurm support (as it effectively calls "srun orted" and that "orted"
launches the MPI ranks, so in 20.11 it only has access to a single core for
them all to fight over).  Again - if I'm interpreting the bugs correctly!

I don't currently have a test system that's free to try 20.11 on, but
hopefully early in the new year I'll be able to test this out to see how much
of an impact this is going to have and how we will manage it.

https://bugs.schedmd.com/show_bug.cgi?id=10383
https://bugs.schedmd.com/show_bug.cgi?id=10489

All the best,
Chris




Re: [slurm-users] Slurm Upgrade Philosophy?

2020-12-23 Thread Chris Samuel
On Friday, 18 December 2020 10:10:19 AM PST Jason Simms wrote:

> Thanks to several helpful members on this list, I think I have a much better
> handle on how to upgrade Slurm. Now my question is, do most of you upgrade
> with each major release?

We do, though not immediately and not without a degree of testing on our test 
systems.  One of the big reasons for us upgrading is that we've usually paid 
for features in Slurm for our needs (for example in 20.11 that includes 
scrontab so users won't be tied to favourite login nodes, as well as  the 
experimental RPC queue code due to the large numbers of RPCs our systems need 
to cope with).

I also keep an eye out for discussions of what other sites find with new 
releases too, so I'm following the current concerns about 20.11 and the change 
in behaviour for job steps that do (expanding NVIDIA's example slightly):

#SBATCH --exclusive
#SBATCH -N2
srun --ntasks-per-node=1 python multi_node_launch.py

which (if I'm reading the bugs correctly) fails in 20.11 as that srun no 
longer gets all the allocated resources, instead just gets the default of
--cpus-per-task=1 instead, which also affects things like mpirun in OpenMPI 
built with Slurm support (as it effectively calls "srun orted" and that "orted" 
launches the MPI ranks, so in 20.11 it only has access to a single core for 
them all to fight over).  Again - if I'm interpreting the bugs correctly!

I don't currently have a test system that's free to try 20.11 on, but 
hopefully early in the new year I'll be able to test this out to see how much 
of an impact this is going to have and how we will manage it.

https://bugs.schedmd.com/show_bug.cgi?id=10383
https://bugs.schedmd.com/show_bug.cgi?id=10489

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] Slurm Upgrade Philosophy?

2020-12-18 Thread Alex Chekholko
Hi Jason,

Ultimately each site decides how/why to do it; in my case I tend to do big
"forklift upgrades", so I'm running 18.08 on the current cluster and will
go to latest SLURM for my next cluster build.  But you may have good
reasons to upgrade slurm more often on your existing cluster.  I don't use
any of the advanced features.

Regards,
Alex


On Fri, Dec 18, 2020 at 10:13 AM Jason Simms  wrote:

> Hello all,
>
> Thanks to several helpful members on this list, I think I have a much
> better handle on how to upgrade Slurm. Now my question is, do most of you
> upgrade with each major release?
>
> I recognize that, normally, if something is working well, then don't
> upgrade it! In our case, we're running 20.02, and it seems to be working
> well for us. The notes for 20.11 don't indicate any "must have" features
> for our use cases, but I'm still new to Slurm, so maybe there is a hidden
> benefit I can't immediately see.
>
> Given that, I would normally not consider upgrading. But as I understand
> it, you cannot upgrade more than two major releases back, so if I skip this
> one, I'd have to upgrade to (presumably) 21.08, or else I'd have to "double
> upgrade" if, e.g., I wanted to go from 20.02 to 22.05.
>
> To prevent that, do most people try to stay within the most recent two
> versions? Or do you go as long as you possibly can with your existing
> version, upgrading only if you absolutely must?
>
> Warmest regards,
> Jason
>
> --
> *Jason L. Simms, Ph.D., M.P.H.*
> Manager of Research and High-Performance Computing
> XSEDE Campus Champion
> Lafayette College
> Information Technology Services
> 710 Sullivan Rd | Easton, PA 18042
> Office: 112 Skillman Library
> p: (610) 330-5632
>


[slurm-users] Slurm Upgrade Philosophy?

2020-12-18 Thread Jason Simms
Hello all,

Thanks to several helpful members on this list, I think I have a much
better handle on how to upgrade Slurm. Now my question is, do most of you
upgrade with each major release?

I recognize that, normally, if something is working well, then don't
upgrade it! In our case, we're running 20.02, and it seems to be working
well for us. The notes for 20.11 don't indicate any "must have" features
for our use cases, but I'm still new to Slurm, so maybe there is a hidden
benefit I can't immediately see.

Given that, I would normally not consider upgrading. But as I understand
it, you cannot upgrade more than two major releases back, so if I skip this
one, I'd have to upgrade to (presumably) 21.08, or else I'd have to "double
upgrade" if, e.g., I wanted to go from 20.02 to 22.05.

To prevent that, do most people try to stay within the most recent two
versions? Or do you go as long as you possibly can with your existing
version, upgrading only if you absolutely must?

Warmest regards,
Jason

-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632