Is anyone monitoring cluster utilization with a higher-level view than
simply job (qacct) stastics and CPU-seconds used/available?
I'm running SoGE 8.1.6 on a cluster with ~70 nodes, ~1400 cores and
200~350K jobs/month and I'm seeking ways to understand the utilization &
resource constraints in our cluster overall.
The 'jobstats' script is fine for giving feedback to users, looking
things like avg/high/low job runtime, wait time, etc., but it doesn't
give good information about overall cluster utilization.
I'd like to see these kind of metrics on cluster use:
histogram of CPU utilization, ie:
Utilization Time
100% 5%
90% 20%
histogram of overall memory use, ie:
Utilization Time
100% 0%
90% 60%
correlation between jobs waiting (CPUs idle) and available memory, as
in:
Jan 1 14:00 - 20:00
avg 4GB free/node
avg 50% CPU-slots used
avg 12GB RAM request for jobs in 'qw'
memory is constraint, cluster is fully
utilized but CPUs are idle
Jan 8 08:00 - 14:00
avg 32GB free/node
avg 98% CPU-slots used
avg 2GB RAM request for jobs in 'qw'
CPU is constraint, cluster is fully
utilized but memory is unused
number of jobs queued/waiting (excluding 'hold' jobs)
number of CPUs requested vs [CPU time/wallclock time]
(useful for detecting if users are requesting multiple
cores in the 'threaded' PE but running single-threaded
jobs)
amount of memory used per job as a function of request, ie:
requested used avg
========= ========
4GB 2.1GB
12GB 9GB
20GB 17GB
average duration job spends in 'qw' state
duration of queue time as a function of number of CPUs requested, ie
1CPU 1hr avg in 'qw'
2CPU 2hr avg in 'qw'
4CPU 12hr avg in 'qw'
duration of queue time as a function of amount of RAM requested
4GB 1hr avg in 'qw'
12GB 2hr avg in 'qw'
20GB 12hr avg in 'qw'
I think that the only way to get this information would be to run 'qstats'
periodically, capture & process that data....any better suggestions or
scripts that anyone can share?
Thanks,
Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users