[slurm-users] Re: Avoiding fragmentation

2024-04-10 Thread Williams, Jenny Avis via slurm-users
Various options that might help reduce job fragmentation.

Turn up debugging on slurmctld and add the DebugFlags like TraceJobs, 
SelectType, and Steps. With debugging set high enough one can see a good bit of 
the logic in regard to node selection.  

  CR_LLN Schedule  resources  to  jobs  on  the  least loaded nodes
 (based upon the number of idle CPUs).  This  is  generally
 only  recommended  for  an environment with serial jobs as
 idle resources will tend to be highly fragmented,  result-
 ing  in parallel jobs being distributed across many nodes.
 Note that node Weight takes precedence over how many  idle
 resources  are  on each node.  Also see the partition con-
 figuration parameter LLN use the  least  loaded  nodes  in
 selected partitions.

Explore node weights.  If your nodes are not identical apply node 
weights to sort your nodes in the order of how you wish them to be selected; on 
the other hand, even for homogenous nodes you might try sets of weights to have 
the scheduler within a given scheduling cycle consider a smaller number of 
nodes of a weight before then considering the next number of nodes of the next 
weight. The number of nodes within a weight set might be no smaller than 1/3 or 
1/4 of the total partition size.  YMMV based on for instance ratio of serial 
jobs to MPI jobs, job length, etc. I have seen evidence that node allocation 
progresses roughly this way.

Turn on backfill and educate users to better fit both their job 
resource requirements and the job runtime.   This will allow backfill to work 
more efficiently. Note that backfill choices are made within a given set of job 
within a partition. 


  CR_Pack_Nodes
 If  a  job allocation contains more resources than will be
 used for launching tasks (e.g. if whole  nodes  are  allo-
 cated  to  a  job),  then rather than distributing a job's
 tasks evenly across its  allocated  nodes,  pack  them  as
 tightly as possible on these nodes.  For example, consider
 a job allocation containing two entire  nodes  with  eight
 CPUs  each.   If the job starts ten tasks across those two
 nodes without this option, it will  start  five  tasks  on
 each of the two nodes.  With this option, eight tasks will
 be started on the first node and two tasks on  the  second
 node.   This  can  be  superseded  by  "NoPack"  in srun's
 "--distribution" option.  CR_Pack_Nodes only applies  when
 the "block" task distribution method is used.

  pack_serial_at_end
 If used with the select/cons_res or select/cons_tres plug-
 in, then put serial jobs at the end of the available nodes
 rather than using a best fit algorithm.  This  may  reduce
 resource fragmentation for some workloads.

  reduce_completing_frag
 This option is used to control how scheduling of resources
 is  performed when jobs are in the COMPLETING state, which
 influences potential fragmentation.  If this option is not
 set then no jobs will be started in any partition when any
 job is in the COMPLETING state for less than  CompleteWait
 seconds.   If  this  option  is  set  then no jobs will be
 started in any individual partition that has a job in COM-
 PLETING  state  for  less  than  CompleteWait seconds.  In
 addition, no jobs will be started in  any  partition  with
 nodes  that overlap with any nodes in the partition of the
 completing job.  This option is to be used in  conjunction
 with CompleteWait.

-Original Message-
From: Gerhard Strangar via slurm-users  
Sent: Tuesday, April 9, 2024 12:53 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Avoiding fragmentation

Hi,

I'm trying to figure out how to deal with a mix of few- and many-cpu jobs. By 
that I mean most jobs use 128 cpus, but sometimes there are jobs with only 16. 
As soon as that job with only 16 is running, the scheduler splits the next 128 
cpu jobs into 96+16 each, instead of assigning a full 128 cpu node to them. Is 
there a way for the administrator to achieve preferring full nodes?
The existence of pack_serial_at_end makes me believe there is not, because that 
basically is what I needed, apart from my serial jobs using
16 cpus instead of 1.

Gerhard

--
slurm-users mailing 

[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Juergen Salk via slurm-users
Hi Gerhard,

I am not sure if this counts as administrative measure, but we do
highly encourage our users to always explicitely specify --nodes=n 
together with --ntasks-per-node=m (rather than just --ntasks=n*m and 
omitting --nodes option, which may lead to cores allocated here and 
there and everywhere as long as network topology allows this).

I do understand Loris' and Tim's arguments, but for certain reasons we
have configured single user node access policy (ExclusiveUser=YES),
which allows multiple jobs to share a node, but only jobs owned by
one and the same user. So we also try to avoid fragmentation whenever
possible and want users to pack their jobs as densely as possible on
the nodes in order to leave as many nodes as possible available for
others. For us, this works reasonably well in terms of core
utilization because we have almost no users who submit only one or two
few-core jobs at a time but usually whole bunches of such jobs
(sometimes hundreds) at once of which multiple jobs then
simultaneously run on the individual nodes. That keeps the waste of
unallocated cores on individual nodes within acceptable limits for us.

Best regards
Jürgen


* Loris Bennett via slurm-users  [240409 07:51]:
> Hi Gerhard,
> 
> Gerhard Strangar via slurm-users  writes:
> 
> > Hi,
> >
> > I'm trying to figure out how to deal with a mix of few- and many-cpu
> > jobs. By that I mean most jobs use 128 cpus, but sometimes there are
> > jobs with only 16. As soon as that job with only 16 is running, the
> > scheduler splits the next 128 cpu jobs into 96+16 each, instead of
> > assigning a full 128 cpu node to them. Is there a way for the
> > administrator to achieve preferring full nodes?
> > The existence of pack_serial_at_end makes me believe there is not,
> > because that basically is what I needed, apart from my serial jobs using
> > 16 cpus instead of 1.
> >
> > Gerhard
> 
> This may well not be relevant for your case, but we actively discourage
> the use of full nodes for the following reasons:
> 
>   - When the cluster is full, which is most of the time, MPI jobs in
> general will start much faster if they don't specify the number of
> nodes and certainly don't request full nodes.  The overhead due to
> the jobs being scattered across nodes is often much lower than the
> additional waiting time incurred by requesting whole nodes.
> 
>   - When all the cores of a node are requested, all the memory of the
> node becomes unavailable to other jobs, regardless of how much
> memory is requested or indeed how much is actually used.  This holds
> up jobs with low CPU but high memory requirements and thus reduces
> the total throughput of the system.
> 
> These factors are important for us because we have a large number of
> single core jobs and almost all the users, whether doing MPI or not,
> significantly overestimate the memory requirements of their jobs.
> 
> Cheers,
> 
> Loris
> 
> -- 
> Dr. Loris Bennett (Herr/Mr)
> FUB-IT (ex-ZEDAT), Freie Universität Berlin
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471


smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Paul Edmon via slurm-users
I wrote a little blog post on this topic a few years back: 
https://www.rc.fas.harvard.edu/blog/cluster-fragmentation/



It's a vexing problem, but as noted by the other responders it is 
something that depends on your cluster policy and job performance needs. 
Well written MPI code should be able to scale well even when given 
non-optimal topologies.



You might also look at Node Weights 
(https://slurm.schedmd.com/slurm.conf.html#OPT_Weight). We use them on 
mosaic partitions so that the latest hardware is left available for 
larger jobs needing more performance.  You can also use it to force jobs 
to one side of the partition, though generally the scheduler does this 
automatically.



-Paul Edmon-


On 4/9/24 6:45 AM, Cutts, Tim via slurm-users wrote:
Agree with that.   Plus, of course, even if the jobs run a bit slower 
by not having all the cores on a single node, they will be scheduled 
sooner, so the overall turnaround time for the user will be better, 
and ultimately that's what they care about. I've always been of the 
view, for any scheduler, that the less you try to constrain it the 
better.  It really depends on what you're trying to optimise for, but 
generally speaking I try to optimise for maximum utilisation and 
throughput, unless I have a specific business case that needs to 
prioritise particular workloads, and then I'll compromise on 
throughput to get the urgent workload through sooner.


Tun

*From:* Loris Bennett via slurm-users 
*Sent:* 09 April 2024 06:51
*To:* slurm-users@lists.schedmd.com 
*Cc:* Gerhard Strangar 
*Subject:* [slurm-users] Re: Avoiding fragmentation
Hi Gerhard,

Gerhard Strangar via slurm-users  writes:

> Hi,
>
> I'm trying to figure out how to deal with a mix of few- and many-cpu
> jobs. By that I mean most jobs use 128 cpus, but sometimes there are
> jobs with only 16. As soon as that job with only 16 is running, the
> scheduler splits the next 128 cpu jobs into 96+16 each, instead of
> assigning a full 128 cpu node to them. Is there a way for the
> administrator to achieve preferring full nodes?
> The existence of pack_serial_at_end makes me believe there is not,
> because that basically is what I needed, apart from my serial jobs using
> 16 cpus instead of 1.
>
> Gerhard

This may well not be relevant for your case, but we actively discourage
the use of full nodes for the following reasons:

  - When the cluster is full, which is most of the time, MPI jobs in
    general will start much faster if they don't specify the number of
    nodes and certainly don't request full nodes.  The overhead due to
    the jobs being scattered across nodes is often much lower than the
    additional waiting time incurred by requesting whole nodes.

  - When all the cores of a node are requested, all the memory of the
    node becomes unavailable to other jobs, regardless of how much
    memory is requested or indeed how much is actually used.  This holds
    up jobs with low CPU but high memory requirements and thus reduces
    the total throughput of the system.

These factors are important for us because we have a large number of
single core jobs and almost all the users, whether doing MPI or not,
significantly overestimate the memory requirements of their jobs.

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


AstraZeneca UK Limited is a company incorporated in England and Wales 
with registered number:03674842 and its registered office at 1 Francis 
Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.


This e-mail and its attachments are intended for the above named 
recipient only and may contain confidential and privileged 
information. If they have come to you in error, you must not copy or 
show them to anyone; instead, please reply to this e-mail, 
highlighting the error to the sender and then immediately delete the 
message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor 
communications, please see our privacy notice at www.astrazeneca.com 
<https://www.astrazeneca.com>



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Cutts, Tim via slurm-users
Agree with that.   Plus, of course, even if the jobs run a bit slower by not 
having all the cores on a single node, they will be scheduled sooner, so the 
overall turnaround time for the user will be better, and ultimately that's what 
they care about.  I've always been of the view, for any scheduler, that the 
less you try to constrain it the better.  It really depends on what you're 
trying to optimise for, but generally speaking I try to optimise for maximum 
utilisation and throughput, unless I have a specific business case that needs 
to prioritise particular workloads, and then I'll compromise on throughput to 
get the urgent workload through sooner.

Tun

From: Loris Bennett via slurm-users 
Sent: 09 April 2024 06:51
To: slurm-users@lists.schedmd.com 
Cc: Gerhard Strangar 
Subject: [slurm-users] Re: Avoiding fragmentation

Hi Gerhard,

Gerhard Strangar via slurm-users  writes:

> Hi,
>
> I'm trying to figure out how to deal with a mix of few- and many-cpu
> jobs. By that I mean most jobs use 128 cpus, but sometimes there are
> jobs with only 16. As soon as that job with only 16 is running, the
> scheduler splits the next 128 cpu jobs into 96+16 each, instead of
> assigning a full 128 cpu node to them. Is there a way for the
> administrator to achieve preferring full nodes?
> The existence of pack_serial_at_end makes me believe there is not,
> because that basically is what I needed, apart from my serial jobs using
> 16 cpus instead of 1.
>
> Gerhard

This may well not be relevant for your case, but we actively discourage
the use of full nodes for the following reasons:

  - When the cluster is full, which is most of the time, MPI jobs in
general will start much faster if they don't specify the number of
nodes and certainly don't request full nodes.  The overhead due to
the jobs being scattered across nodes is often much lower than the
additional waiting time incurred by requesting whole nodes.

  - When all the cores of a node are requested, all the memory of the
node becomes unavailable to other jobs, regardless of how much
memory is requested or indeed how much is actually used.  This holds
up jobs with low CPU but high memory requirements and thus reduces
the total throughput of the system.

These factors are important for us because we have a large number of
single core jobs and almost all the users, whether doing MPI or not,
significantly overestimate the memory requirements of their jobs.

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com<https://www.astrazeneca.com>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Avoiding fragmentation

2024-04-08 Thread Loris Bennett via slurm-users
Hi Gerhard,

Gerhard Strangar via slurm-users  writes:

> Hi,
>
> I'm trying to figure out how to deal with a mix of few- and many-cpu
> jobs. By that I mean most jobs use 128 cpus, but sometimes there are
> jobs with only 16. As soon as that job with only 16 is running, the
> scheduler splits the next 128 cpu jobs into 96+16 each, instead of
> assigning a full 128 cpu node to them. Is there a way for the
> administrator to achieve preferring full nodes?
> The existence of pack_serial_at_end makes me believe there is not,
> because that basically is what I needed, apart from my serial jobs using
> 16 cpus instead of 1.
>
> Gerhard

This may well not be relevant for your case, but we actively discourage
the use of full nodes for the following reasons:

  - When the cluster is full, which is most of the time, MPI jobs in
general will start much faster if they don't specify the number of
nodes and certainly don't request full nodes.  The overhead due to
the jobs being scattered across nodes is often much lower than the
additional waiting time incurred by requesting whole nodes.

  - When all the cores of a node are requested, all the memory of the
node becomes unavailable to other jobs, regardless of how much
memory is requested or indeed how much is actually used.  This holds
up jobs with low CPU but high memory requirements and thus reduces
the total throughput of the system.

These factors are important for us because we have a large number of
single core jobs and almost all the users, whether doing MPI or not,
significantly overestimate the memory requirements of their jobs.

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com