Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Diego Zuccato

Il 17/05/21 09:25, Ole Holm Nielsen ha scritto:

I hope that someone on the list can help you build Debian packages.  
The problem is not just rebuilding Slurm: if I rebuild Slurm, I have to 
rebuild OpenMPI, OpenIB and a alot of other stuff that I don't know with 
the needed detail.


When you find the time, you must upgrade by at most 2 Slurm versions at 
a time, so you have to upgrade in two steps, for example 
18.08->19.05->20.11.
I usually just stop everything for the upgrade, then upgrade to whatever 
Debian is shipping at the moment. If the history is lost, it's not a big 
issue (that's what DB backups are for :) ).


My Slurm upgrade instructions refer to CentOS, but the overall process 
would be the same for all Linuxes:

https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
Please read carefully the existing documentation from SchedMD linked to 
in this page.

Tks.

I upgrade Slurm frequently and have no problems doing so.  We're at 
20.11.7 now.  You should avoid 20.11.{0-2} due to a bug in MPI.

That's a really useful info.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Ole Holm Nielsen

On 5/17/21 8:59 AM, Diego Zuccato wrote:

Il 15/05/21 00:43, Christopher Samuel ha scritto:


It just doesn't recognize 'ALL'. It works if I specify the resources.

That's odd, what does this say?
sreport --version

slurm-wlm 18.08.5-2
That's the package from Debian stable (we don't have the manpower to 
handle manually-compiled packages).
As Ole said, it's an old version. I'd love to be able to keep up with the 
newest releases, but ... :(


I hope that someone on the list can help you build Debian packages.  When 
you find the time, you must upgrade by at most 2 Slurm versions at a time, 
so you have to upgrade in two steps, for example 18.08->19.05->20.11.


My Slurm upgrade instructions refer to CentOS, but the overall process 
would be the same for all Linuxes:

https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
Please read carefully the existing documentation from SchedMD linked to in 
this page.


I upgrade Slurm frequently and have no problems doing so.  We're at 
20.11.7 now.  You should avoid 20.11.{0-2} due to a bug in MPI.


/Ole



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Diego Zuccato

Il 15/05/21 00:43, Christopher Samuel ha scritto:


It just doesn't recognize 'ALL'. It works if I specify the resources.

That's odd, what does this say?
sreport --version

slurm-wlm 18.08.5-2
That's the package from Debian stable (we don't have the manpower to 
handle manually-compiled packages).
As Ole said, it's an old version. I'd love to be able to keep up with 
the newest releases, but ... :(


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-16 Thread Juergen Salk
* Juergen Salk  [210515 23:54]:
> * Christopher Samuel  [210514 15:47]:
> 
> > > Usage reported in Percentage of Total
> > > 
> > > 
> > >    Cluster  TRES Name    Allocated Down PLND Dow    Idle
> > > Reserved Reported
> > > - --    ---
> > >  
> > >    oph    cpu   81.93%    0.00%    0.00%  15.85%
> > > 2.22%  100.00%
> > >    oph    mem   80.60%    0.00%    0.00%  19.40%
> > > 0.00%  100.00%
> > 
> > The "Reserved" column is the one you're interested in, it's indicating that
> > for the 13th some jobs were waiting for CPUs, not memory.
> 
> 
> However, there is also "Overcommited" in the sreport man page which
> looks promising by description - although its exact definition 
> is also not completely clear to me right away:
> 
> --- snip ---
> 
> Overcommited  
> 
>Time of eligible jobs waiting in the queue over the Reserved time.
>Unlike Reserved, this has no limit. It is typically useful to
>determine whether your system is overloaded and by how much.
> 
> --- snip ---

And I just noticed that this description of "Overcommited" in sreport(1) 
man page first came in with versions 20.02.7 and 20.11.1, respectively.

In versions prior to 20.02.7 and 20.11.1 this still was:

--- snip ---

Overcommited

   Time that the nodes were over allocated, either with the -O,
   --overcommit flag at submission time or OverSubscribe set to FORCE
   in the slurm.conf. This time is not counted against the total
   reported time.

--- snip ---

So, I assume, the description of "Overcommited" in sreport(1) man page was 
simply wrong in older versions (unless its semantics has changed with
version 20.02.7 and 20.11.1 ) ...

Best regards
Jürgen





Re: [slurm-users] Determining Cluster Usage Rate

2021-05-15 Thread Juergen Salk
* Christopher Samuel  [210514 15:47]:

> > Usage reported in Percentage of Total
> > 
> > 
> >    Cluster  TRES Name    Allocated Down PLND Dow    Idle
> > Reserved Reported
> > - --    ---
> >  
> >    oph    cpu   81.93%    0.00%    0.00%  15.85%
> > 2.22%  100.00%
> >    oph    mem   80.60%    0.00%    0.00%  19.40%
> > 0.00%  100.00%
> 
> The "Reserved" column is the one you're interested in, it's indicating that
> for the 13th some jobs were waiting for CPUs, not memory.

Hi Chris,

the wording in the documentation is somewhat nebulous, but my
understanding is that the "Reserved" column in sreport indicates the
amount of resources that were actually idle but reserved by Slurm for
scheduling purposes and, thus, unavailable for immediate job
allocations. I assume this includes, for example, the time the
scheduler needs to free sufficient resources for the highest priority
job that is waiting for the number of requested nodes to become
available. I think, there might be more reasons for Slurm to mark
resources reserved (but not including resource reservations created
with scontrol as these are reported as "Allocated" resources by
sreport unless created with MAINT or IGNORE_JOBS flags).

Anyway, as far as I understand the documentation, the sreport
"Reserved" column by itself does not necessarily indicate the degree
of (over-)utilitzation of the cluster as it does not take into account
the amount of jobs in the queue for which Slurm has not yet started 
blocking idle resources. So, confusingly, there is a difference between
"Reserved" in sreport and sacct. In sreport "Reserved" refers to idle
but reserved cluster resources whereas in sacct "Reserved" means the
waiting time of jobs. Or do I understand this wrong? 

However, there is also "Overcommited" in the sreport man page which
looks promising by description - although its exact definition 
is also not completely clear to me right away:

--- snip ---

Overcommited  

   Time of eligible jobs waiting in the queue over the Reserved time.
   Unlike Reserved, this has no limit. It is typically useful to
   determine whether your system is overloaded and by how much.

--- snip ---

This field is not included by default in the report but can be added with 
the Format option, e.g. 

sreport  -t percent -T ALL cluster utilization 
Format=TRESName,Allocated,PlannedDown,Down,Idle,Reserved,Overcommitted,Reported

(Note: There seems to be a typo in the scontrol man page. It should read
"Overcommitted" rather than "Overcommited".)

Best regards
Jürgen





Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Christopher Samuel

On 5/14/21 1:45 am, Diego Zuccato wrote:


Usage reported in Percentage of Total
 

   Cluster  TRES Name    Allocated Down PLND Dow    Idle 
Reserved Reported
- --    --- 
 
   oph    cpu   81.93%    0.00%    0.00%  15.85% 
2.22%  100.00%
   oph    mem   80.60%    0.00%    0.00%  19.40% 
0.00%  100.00%


The "Reserved" column is the one you're interested in, it's indicating 
that for the 13th some jobs were waiting for CPUs, not memory.


You can look at a longer reporting period by specifying a start date,
something like:

sreport -t percent -T cpu,mem cluster utilization start=2021-01-01

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Christopher Samuel

On 5/14/21 1:45 am, Diego Zuccato wrote:


It just doesn't recognize 'ALL'. It works if I specify the resources.


That's odd, what does this say?

sreport --version

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Paul Edmon
XDMod can give these sorts of stats.  I also have some diamond 
collectors we use in concert with grafana to pull data and plot it which 
is useful for seeing large scale usage trends:


https://github.com/fasrc/slurm-diamond-collector

-Paul Edmon-

On 5/13/2021 6:08 PM, Sid Young wrote:


Hi All,

Is there a way to define an effective "usage rate" of a HPC Cluster 
using the data captured in the slurm database.


Primarily I want to see if it can be helpful in presenting to the 
business a case for buying more hardware for the HPC :)


Sid Young


Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Diego Zuccato

Il 14/05/21 10:24, Ole Holm Nielsen ha scritto:

Referring to https://slurm.schedmd.com/tres.html, which TRES are defined 
on your cluster?

It just doesn't recognize 'ALL'. It works if I specify the resources.

root@str957-cluster:/var/log# sacctmgr show tres
TypeName ID
 --- --
 cpu  1
 mem  2
  energy  3
node  4
 billing  5
  fsdisk  6
vmem  7
   pages  8
root@str957-cluster:/var/log# sreport -t percent -T ALL cluster utilization
sreport: fatal: No valid TRES given
root@str957-cluster:/var/log# sreport -t percent -T cpu,mem cluster 
utilization


Cluster Utilization 2021-05-13T00:00:00 - 2021-05-13T23:59:59
Usage reported in Percentage of Total

  Cluster  TRES NameAllocated Down PLND DowIdle 
Reserved Reported
- --    --- 
 
  ophcpu   81.93%0.00%0.00%  15.85% 
2.22%  100.00%
  ophmem   80.60%0.00%0.00%  19.40% 
0.00%  100.00%


BYtE,
 Diego

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Ole Holm Nielsen

On 14-05-2021 08:52, Diego Zuccato wrote:

Il 14/05/2021 08:19, Christopher Samuel ha scritto:


sreport -t percent -T ALL cluster utilization

"sreport: fatal: No valid TRES given" :(


This works correctly on our cluster:

$  sreport -t percent -T ALL cluster utilization

Cluster Utilization 2021-05-13T00:00:00 - 2021-05-13T23:59:59
Usage reported in Percentage of Total

  Cluster  TRES Name  AllocatedDown PLND Dow 
Idle Reserved   Reported
- -- -- ---  
-  --
 niflheimcpu 98.22%   0.11%0.00% 
0.00%1.67%100.00%
 niflheimmem 86.52%   0.10%0.00% 
13.38%0.00%100.00%
 niflheim energy  0.00%   0.00%0.00% 
0.00%0.00%  0.00%
 niflheimbilling 92.70%   0.04%0.00% 
7.26%0.00%100.00%
 niflheimfs/disk  0.00%   0.00%0.00% 
0.00%0.00%  0.00%
 niflheim   vmem  0.00%   0.00%0.00% 
0.00%0.00%  0.00%
 niflheim  pages  0.00%   0.00%0.00% 
0.00%0.00%  0.00%



Referring to https://slurm.schedmd.com/tres.html, which TRES are defined 
on your cluster?


$ sacctmgr show tres

I get this output:

TypeName ID
 --- --
 cpu  1
 mem  2
  energy  3
node  4
 billing  5
  fsdisk  6
vmem  7
   pages  8

/Ole



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Diego Zuccato

Il 14/05/2021 08:19, Christopher Samuel ha scritto:


sreport -t percent -T ALL cluster utilization

"sreport: fatal: No valid TRES given" :(

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Christopher Samuel

On 5/13/21 3:08 pm, Sid Young wrote:


Hi All,


Hiya,

Is there a way to define an effective "usage rate" of a HPC Cluster 
using the data captured in the slurm database.


Primarily I want to see if it can be helpful in presenting to the 
business a case for buying more hardware for the HPC  :)


I have a memory that it's possible to use "sreport" to show you what 
amount of time jobs were waiting for what TRES - in other words whether 
they were waiting for CPUs, memory, GPUs, etc (or some combination).


Ah here you go..

sreport -t percent -T ALL cluster utilization

That breaks things down by all the trackable resources on your system.

Hope that helps!
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-13 Thread Sid Young
Yes, on reflection I should have said utilization rather than usage! I've
been researching what the most likely combination of metrics would give me
an overall utilization of the HPC.
Sadly its not as clear cut as I would have hoped.

Does anyone have any ideas?



Sid Young


On Fri, May 14, 2021 at 1:19 PM Doug Meyer  wrote:

> Probably need to define the problem a bit better.  spreport has very good
> functionality, see the boom of the man page for examples.  You can group
> orgs in accounting groups to map like use and use wckeys to provide
> accounting for specific users billing groups.  Configure tres billing to
> get a chargeback/shareback view.  Recently found this very good site on
> chargeback with tools to help.  Usage Charging Policy - ULHPC Technical
> Documentation (uni.lu) 
>
> Hope it helps.
>
> Doug
>
> On Thu, May 13, 2021 at 3:10 PM Sid Young  wrote:
>
>>
>> Hi All,
>>
>> Is there a way to define an effective "usage rate" of a HPC Cluster using
>> the data captured in the slurm database.
>>
>> Primarily I want to see if it can be helpful in presenting to the
>> business a case for buying more hardware for the HPC  :)
>>
>> Sid Young
>>
>


Re: [slurm-users] Determining Cluster Usage Rate

2021-05-13 Thread Doug Meyer
Probably need to define the problem a bit better.  spreport has very good
functionality, see the boom of the man page for examples.  You can group
orgs in accounting groups to map like use and use wckeys to provide
accounting for specific users billing groups.  Configure tres billing to
get a chargeback/shareback view.  Recently found this very good site on
chargeback with tools to help.  Usage Charging Policy - ULHPC Technical
Documentation (uni.lu) 

Hope it helps.

Doug

On Thu, May 13, 2021 at 3:10 PM Sid Young  wrote:

>
> Hi All,
>
> Is there a way to define an effective "usage rate" of a HPC Cluster using
> the data captured in the slurm database.
>
> Primarily I want to see if it can be helpful in presenting to the
> business a case for buying more hardware for the HPC  :)
>
> Sid Young
>