Also check out xdmod:
http://xdmod.sourceforge.net/
On Thu, Feb 25, 2016 at 09:51:51AM +0100, RDlab wrote:
Hello,
I would suggest that you take a look at S-GAE. It gathers data from qactt and
display information using eye-candy graphics for user, queue and whole cluster.
It shows the process memory usage, averages, queue wait time???.
By the way, it is free software under GNU license and we are really happy with
it :)
http://rdlab.cs.upc.edu/s-gae
Best regards,
Gabriel
--
RDlab (Campus Nord - UPC) -- http://rdlab.cs.upc.edu
C/ Jordi Girona 1-3. Edifici Omega, Despatx 005
08034 Barcelona
Telf: +34 93 413 78 20
El 24 feb 2016, a las 21:22, [email protected] escribió:
Is anyone monitoring cluster utilization with a higher-level view than
simply job (qacct) stastics and CPU-seconds used/available?
I'm running SoGE 8.1.6 on a cluster with ~70 nodes, ~1400 cores and
200~350K jobs/month and I'm seeking ways to understand the utilization &
resource constraints in our cluster overall.
The 'jobstats' script is fine for giving feedback to users, looking
things like avg/high/low job runtime, wait time, etc., but it doesn't
give good information about overall cluster utilization.
I'd like to see these kind of metrics on cluster use:
histogram of CPU utilization, ie:
Utilization Time
100% 5%
90% 20%
histogram of overall memory use, ie:
Utilization Time
100% 0%
90% 60%
correlation between jobs waiting (CPUs idle) and available memory, as
in:
Jan 1 14:00 - 20:00
avg 4GB free/node
avg 50% CPU-slots used
avg 12GB RAM request for jobs in 'qw'
memory is constraint, cluster is fully
utilized but CPUs are idle
Jan 8 08:00 - 14:00
avg 32GB free/node
avg 98% CPU-slots used
avg 2GB RAM request for jobs in 'qw'
CPU is constraint, cluster is fully
utilized but memory is unused
number of jobs queued/waiting (excluding 'hold' jobs)
number of CPUs requested vs [CPU time/wallclock time]
(useful for detecting if users are requesting multiple
cores in the 'threaded' PE but running single-threaded
jobs)
amount of memory used per job as a function of request, ie:
requested used avg
========= ========
4GB 2.1GB
12GB 9GB
20GB 17GB
average duration job spends in 'qw' state
duration of queue time as a function of number of CPUs requested, ie
1CPU 1hr avg in 'qw'
2CPU 2hr avg in 'qw'
4CPU 12hr avg in 'qw'
duration of queue time as a function of amount of RAM requested
4GB 1hr avg in 'qw'
12GB 2hr avg in 'qw'
20GB 12hr avg in 'qw'
I think that the only way to get this information would be to run 'qstats'
periodically, capture & process that data....any better suggestions or
scripts that anyone can share?
Thanks,
Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Jesse Becker (Contractor)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users