Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

John Snowdon Fri, 08 Sep 2023 01:46:29 -0700

I've been needing to do this as part of some analysis work we are undertaking 
to determine requirements for a replacement system.


We don't have anything structured in place currently to analyse Slurm data; 
lots of Grafana system-level metrics but nothing to look at trends of useful 
metrics like:

- Size (and age) of jobs sitting in the pending state
- Average runtime of jobs
- Plotting workload sizing information such as cores/job and memory/core so 
that we can understand how our users are utilising the service
- Demand (and utilisation) of particular partitions

I couldn't find anything that was exactly what we wanted, so I spent a couple 
of afternoons last week putting something together in Python to wrap around 
sacct / sinfo output.

So far I've got reports for what is happening 'now', as well as summaries for 
the following periods:

24 hours
7 days
30 days
1 year

Data is analysed based on jobs running/pending/completed/failed during windows 
in time and summarised in terms of sample periods per day (a 24 report having 
the finest sampling resolution of 6x 10 minute windows per hour), and the 
output of each sample period is stored as a persistent json object on the 
filesystem in case the same report is ran again, or that period is included as 
part of a larger analysis window.

I output to flat HTML files using the Jinja2 templating module and visualise 
data using the ubiquitous Highcharts and DataTables javascript libraries.

In our case we're more interested in things like:

Min/Max/Median cores/job, plus lowest average value which would satisfy X% of 
all jobs
Min/Max/Median memory/core, plus lowest average value which would satisfy X% of 
all jobs
Min/Max/Median nodes/job, plus lowest average value which would satisfy X% of 
all jobs
Backlog of jobs waiting in pending state
Percentage of jobs that 'fail' (end up in some state other than completed)
Scatter chart of cores/job to memory/core (i.e. what is the bulk of our user 
workload; parallel/serial, low memory/high memory?)

i.e. data points which will be useful in our sizing decisions of a replacement 
platform, both in terms of hardware, as well as partition definitions.

When it's at a point where it is useable, I'm sure that we can share the code. 
It's pretty much self-contained; the only dependencies being Slurm and Python 3 
installed - no web components needed (unless you want to serve the generated 
reports to users, of course).

John Snowdon
Advanced Computing Consultant

Newcastle University IT Service
The Elizabeth Barraclough Building
91 Sandyford Road
Newcastle upon Tyne, 
NE1 8HW

Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

Reply via email to