Is anyone monitoring cluster utilization with a higher-level view than
simply job (qacct) stastics and CPU-seconds used/available?

I'm running SoGE 8.1.6  on a cluster with ~70 nodes, ~1400 cores and
200~350K jobs/month and I'm seeking ways to understand the utilization &
resource constraints in our cluster overall.

The 'jobstats' script is fine for giving feedback to users, looking
things like avg/high/low job runtime, wait time, etc., but it doesn't
give good information about overall cluster utilization.


I'd like to see these kind of metrics on cluster use:

        histogram of CPU utilization, ie:
                Utilization     Time
                100%            5%
                 90%            20%

        histogram of overall memory use, ie:
                Utilization     Time
                100%            0%
                 90%            60%

        correlation between jobs waiting (CPUs idle) and available memory, as
        in:
                Jan 1   14:00 - 20:00
                        avg 4GB free/node
                        avg 50% CPU-slots used
                        avg 12GB RAM request for jobs in 'qw'
                                memory is constraint, cluster is fully
                                utilized but CPUs are idle

                Jan 8   08:00 - 14:00
                        avg 32GB free/node
                        avg 98% CPU-slots used
                        avg 2GB RAM request for jobs in 'qw'
                                CPU is constraint, cluster is fully
                                utilized but memory is unused


        number of jobs queued/waiting (excluding 'hold' jobs)

        number of CPUs requested vs [CPU time/wallclock time]
                (useful for detecting if users are requesting multiple
                cores in the 'threaded' PE but running single-threaded
                jobs)

        amount of memory used per job as a function of request, ie:
                requested       used avg
                =========       ========
                4GB             2.1GB
                12GB            9GB
                20GB            17GB

        average duration job spends in 'qw' state

        duration of queue time as a function of number of CPUs requested, ie
                1CPU    1hr avg in 'qw'
                2CPU    2hr avg in 'qw'
                4CPU    12hr avg in 'qw'

        duration of queue time as a function of amount of RAM requested
                4GB     1hr avg in 'qw'
                12GB    2hr avg in 'qw'
                20GB    12hr avg in 'qw'

I think that the only way to get this information would be to run 'qstats'
periodically, capture & process that data....any better suggestions or
scripts that anyone can share?

Thanks,

Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to