Re: [slurm-users] slurm reporting

2019-11-29 Thread Mark Hahn

Thanks for your insight. We also work with elasticsearch and I appreciate the 
easy analysis (once one understands Kibana logic). Do you use job completion 
plugin as is? Or did you modify it to account for ssl or additional metrics?




From a central location, we poll data from each cluster - including sacct,

but also KPI-like measures (node status, partitions, accounts).  These are
just streams of json that flow through logstash.

Mainly this is because we need a detailed, global view across all clusters,
but also partly for historic reasons (pre-existing systems expect job data 
in a different, ad-hoc format), and partly to keep systems loosely coupled.


regards, mark hahn
--
operator may differ from spokesperson.  h...@mcmaster.ca



Re: [slurm-users] slurm reporting

2019-11-27 Thread Henkel, Andreas
Hi Mark,

Thanks for your insight. We also work with elasticsearch and I appreciate the 
easy analysis (once one understands Kibana logic). Do you use job completion 
plugin as is? Or did you modify it to account for ssl or additional metrics? 

Best
Andreas 

Am 26.11.2019 um 18:27 schrieb Mark Hahn :

>> Would Grafana do similar job as XDMoD?
> 
> I was wondering whether to pipe up.  I work for ComputeCanada, which runs a
> number of significant clusters.  During a major upgrade a few years ago,
> we looked at XDMoD, and decided against it.  Primarily because we wanted 
> greater flexibility - we have specific tracking requirements related to the
> national allocation process, and also wanted better support for many sites.
> 
> What we have now is an ElasticSearch-based system, which is accessible via 
> Grafana and other mechnanisms.  It integrates multiple sources of data, such
> as job completion records (scraped very much like XDMoD does it), as well as 
> syslog and other monitoring/collection mechanisms.  It also feeds data into 
> some pre-existing database/reporting mechanisms.
> 
> It's certainly not perfect, but I mention it here because there does seem to
> be a series of queries about managing cluster metadata beyond single Slurm
> instances.  For instance, an exernal repository of job records means you can
> more freely upgrade a cluster's Slurm, since you know all the job data is in 
> an external, scalable system, and you don't have to baby slurmdbd as much.
> 
> So I think what I'm saying is that I'd encourage people to think about using 
> some of the powerful, open-source infrastructure that exists for parts of
> this task.  Kibana or Grafana make it incredibly easy to do basic analysis
> like averages per user.  And having the data in a open infrastructure also
> means that if you want, you can write a 10-line python script to generate a 
> report (maybe joining data in a way Grafana doesn't let you.)  Or if you want 
> to create automated actions (email notice, etc), even mods to Slurm
> controls.
> 
> regards,
> --
> Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
>  | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
>  | Compute/Calcul Canada| http://www.computecanada.ca
> 



Re: [slurm-users] slurm reporting

2019-11-26 Thread Mark Hahn

Would Grafana do similar job as XDMoD?


I was wondering whether to pipe up.  I work for ComputeCanada, which runs a
number of significant clusters.  During a major upgrade a few years ago,
we looked at XDMoD, and decided against it.  Primarily because we wanted 
greater flexibility - we have specific tracking requirements related to the

national allocation process, and also wanted better support for many sites.

What we have now is an ElasticSearch-based system, which is accessible via 
Grafana and other mechnanisms.  It integrates multiple sources of data, such
as job completion records (scraped very much like XDMoD does it), as well as 
syslog and other monitoring/collection mechanisms.  It also feeds data into 
some pre-existing database/reporting mechanisms.


It's certainly not perfect, but I mention it here because there does seem to
be a series of queries about managing cluster metadata beyond single Slurm
instances.  For instance, an exernal repository of job records means you can
more freely upgrade a cluster's Slurm, since you know all the job data is 
in an external, scalable system, and you don't have to baby slurmdbd as much.


So I think what I'm saying is that I'd encourage people to think about using 
some of the powerful, open-source infrastructure that exists for parts of

this task.  Kibana or Grafana make it incredibly easy to do basic analysis
like averages per user.  And having the data in a open infrastructure also
means that if you want, you can write a 10-line python script to generate 
a report (maybe joining data in a way Grafana doesn't let you.)  Or if you 
want to create automated actions (email notice, etc), even mods to Slurm

controls.

regards,
--
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
  | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
  | Compute/Calcul Canada| http://www.computecanada.ca



Re: [slurm-users] slurm reporting

2019-11-26 Thread Renfro, Michael
Once you added enough to ingest the Slurm logs into Influx or whatever, it 
could be similar. XDMoD already has the pieces in place to dig through your 
hierarchy of PIs, users, etc. Plus some built-in queries for correlating job 
size to wait time, for example:

[cid:0F1CF9CC-D46B-4464-A386-5C5BF11B59D9@tntech.edu]

I’ve also started using XDMoD as my data source for some short one-slide 
presentations, where I extract out a graph of the historical resource usage and 
overlay our total job count and total CPU-hours used.

On Nov 26, 2019, at 10:21 AM, Ricardo Gregorio 
mailto:ricardo.grego...@rothamsted.ac.uk>> 
wrote:

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



Mike,

It sounds interesting...In fact I had come across XDMoD this morning while 
"searching" for further info...

Would Grafana do similar job as XDMoD?



-Original Message-
From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Renfro, Michael
Sent: 26 November 2019 16:14
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] slurm reporting

• Total number of jobs submitted by user (daily/weekly/monthly)
• Average queue time per user (daily/weekly/monthly)
• Average job run time per user (daily/weekly/monthly)

Open XDMoD for these three. 
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com<http://2Fgithub.com>%2Fubccr%2Fxdmoddata=01%7C01%7Cricardo.gregorio%40rothamsted.ac.uk<http://40rothamsted.ac.uk>%7C460de352693741c7399508d7728bfb68%7Cb688362589414342b0e37b8cc8392f64%7C1sdata=ePMpRET56c241GOCIU%2Bt3qMkR1vDUeFHv9DLKNb0cVo%3Dreserved=0
 , plus 
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fxdmod.ccr.buffalo.edu<http://2Fxdmod.ccr.buffalo.edu>data=01%7C01%7Cricardo.gregorio%40rothamsted.ac.uk<http://40rothamsted.ac.uk>%7C460de352693741c7399508d7728bfb68%7Cb688362589414342b0e37b8cc8392f64%7C1sdata=DkFnQBRkfAkzpIb6naqsPWXiVvBoRpC1zNr8CRsRpRA%3Dreserved=0
 (unfortunately their SSL certificate expired yesterday, so you’ll get a 
warning).

• %time partitions were in-use and idle

Not sure how you’d want to define this, plus our partitions have substantial 
overlap on resources (our partitions are primarily to separate GPU or large 
memory jobs from others, and to balance priorities and limits on different 
classes of jobs).

• min/mx/avg number of nodes/cpus/mem used per user/job

Open XDMoD for CPUs and nodes, and probably Open XDMoD plus SUPREMM for memory 
(haven’t used this one myself, but I plan to).

--
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

On Nov 26, 2019, at 10:02 AM, Ricardo Gregorio 
mailto:ricardo.grego...@rothamsted.ac.uk>> 
wrote:

External Email Warning
This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.
Hi all,

I am new to both HPC and SLURM.

I have been trying to run some usage reports (using sreport and sacct); but I 
cannot find a way to get the following info:

• Total number of jobs submitted by user (daily/weekly/monthly)
• Average queue time per user (daily/weekly/monthly)
• Average job run time per user (daily/weekly/monthly)
• %time partitions were in-use and idle
• min/mx/avg number of nodes/cpus/mem used per user/job

Is this doable?

Regards,
Ricardo Gregorio
Research and Systems Administrator


Rothamsted Research is a company limited by guarantee, registered in England at 
Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a 
not for profit charity number 802038.


Rothamsted Research is a company limited by guarantee, registered in England at 
Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a 
not for profit charity number 802038.



Re: [slurm-users] slurm reporting

2019-11-26 Thread Ricardo Gregorio
Mike,

It sounds interesting...In fact I had come across XDMoD this morning while 
"searching" for further info...

Would Grafana do similar job as XDMoD?



-Original Message-
From: slurm-users  On Behalf Of Renfro, 
Michael
Sent: 26 November 2019 16:14
To: Slurm User Community List 
Subject: Re: [slurm-users] slurm reporting

> • Total number of jobs submitted by user (daily/weekly/monthly)
> • Average queue time per user (daily/weekly/monthly)
> • Average job run time per user (daily/weekly/monthly)

Open XDMoD for these three. 
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fubccr%2Fxdmoddata=01%7C01%7Cricardo.gregorio%40rothamsted.ac.uk%7C460de352693741c7399508d7728bfb68%7Cb688362589414342b0e37b8cc8392f64%7C1sdata=ePMpRET56c241GOCIU%2Bt3qMkR1vDUeFHv9DLKNb0cVo%3Dreserved=0
 , plus 
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fxdmod.ccr.buffalo.edudata=01%7C01%7Cricardo.gregorio%40rothamsted.ac.uk%7C460de352693741c7399508d7728bfb68%7Cb688362589414342b0e37b8cc8392f64%7C1sdata=DkFnQBRkfAkzpIb6naqsPWXiVvBoRpC1zNr8CRsRpRA%3Dreserved=0
 (unfortunately their SSL certificate expired yesterday, so you’ll get a 
warning).

> • %time partitions were in-use and idle

Not sure how you’d want to define this, plus our partitions have substantial 
overlap on resources (our partitions are primarily to separate GPU or large 
memory jobs from others, and to balance priorities and limits on different 
classes of jobs).

> • min/mx/avg number of nodes/cpus/mem used per user/job

Open XDMoD for CPUs and nodes, and probably Open XDMoD plus SUPREMM for memory 
(haven’t used this one myself, but I plan to).

--
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Nov 26, 2019, at 10:02 AM, Ricardo Gregorio 
>  wrote:
>
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hi all,
>
> I am new to both HPC and SLURM.
>
> I have been trying to run some usage reports (using sreport and sacct); but I 
> cannot find a way to get the following info:
>
> • Total number of jobs submitted by user (daily/weekly/monthly)
> • Average queue time per user (daily/weekly/monthly)
> • Average job run time per user (daily/weekly/monthly)
> • %time partitions were in-use and idle
> • min/mx/avg number of nodes/cpus/mem used per user/job
>
> Is this doable?
>
> Regards,
> Ricardo Gregorio
> Research and Systems Administrator
>
>
> Rothamsted Research is a company limited by guarantee, registered in England 
> at Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 
> and a not for profit charity number 802038.


Rothamsted Research is a company limited by guarantee, registered in England at 
Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a 
not for profit charity number 802038.


Re: [slurm-users] slurm reporting

2019-11-26 Thread Renfro, Michael
>   • Total number of jobs submitted by user (daily/weekly/monthly)
>   • Average queue time per user (daily/weekly/monthly)
>   • Average job run time per user (daily/weekly/monthly)

Open XDMoD for these three. https://github.com/ubccr/xdmod , plus 
https://xdmod.ccr.buffalo.edu (unfortunately their SSL certificate expired 
yesterday, so you’ll get a warning).

>   • %time partitions were in-use and idle

Not sure how you’d want to define this, plus our partitions have substantial 
overlap on resources (our partitions are primarily to separate GPU or large 
memory jobs from others, and to balance priorities and limits on different 
classes of jobs).

>   • min/mx/avg number of nodes/cpus/mem used per user/job

Open XDMoD for CPUs and nodes, and probably Open XDMoD plus SUPREMM for memory 
(haven’t used this one myself, but I plan to).

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Nov 26, 2019, at 10:02 AM, Ricardo Gregorio 
>  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hi all,
>  
> I am new to both HPC and SLURM.
>  
> I have been trying to run some usage reports (using sreport and sacct); but I 
> cannot find a way to get the following info:
>  
>   • Total number of jobs submitted by user (daily/weekly/monthly)
>   • Average queue time per user (daily/weekly/monthly)
>   • Average job run time per user (daily/weekly/monthly)
>   • %time partitions were in-use and idle
>   • min/mx/avg number of nodes/cpus/mem used per user/job
>  
> Is this doable?
>  
> Regards,
> Ricardo Gregorio
> Research and Systems Administrator
>  
> 
> Rothamsted Research is a company limited by guarantee, registered in England 
> at Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 
> and a not for profit charity number 802038.



[slurm-users] slurm reporting

2019-11-26 Thread Ricardo Gregorio
Hi all,

I am new to both HPC and SLURM.

I have been trying to run some usage reports (using sreport and sacct); but I 
cannot find a way to get the following info:


  *   Total number of jobs submitted by user (daily/weekly/monthly)
  *   Average queue time per user (daily/weekly/monthly)
  *   Average job run time per user (daily/weekly/monthly)
  *   %time partitions were in-use and idle
  *   min/mx/avg number of nodes/cpus/mem used per user/job

Is this doable?

Regards,
Ricardo Gregorio
Research and Systems Administrator


Rothamsted Research is a company limited by guarantee, registered in England at 
Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a 
not for profit charity number 802038.