Also check out xdmod:
   http://xdmod.sourceforge.net/

On Thu, Feb 25, 2016 at 09:51:51AM +0100, RDlab wrote:
Hello,

I would suggest that you take a look at S-GAE. It gathers data from qactt and 
display information using eye-candy graphics for user, queue and whole cluster. 
It shows the process memory usage, averages, queue wait time???.

By the way, it is free software under GNU license and we are really happy with 
it :)

http://rdlab.cs.upc.edu/s-gae


Best regards,

Gabriel

--
RDlab (Campus Nord - UPC)  --  http://rdlab.cs.upc.edu
C/ Jordi Girona 1-3. Edifici Omega, Despatx 005
08034 Barcelona

Telf:   +34 93 413 78 20

El 24 feb 2016, a las 21:22, [email protected] escribió:

Is anyone monitoring cluster utilization with a higher-level view than
simply job (qacct) stastics and CPU-seconds used/available?

I'm running SoGE 8.1.6  on a cluster with ~70 nodes, ~1400 cores and
200~350K jobs/month and I'm seeking ways to understand the utilization &
resource constraints in our cluster overall.

The 'jobstats' script is fine for giving feedback to users, looking
things like avg/high/low job runtime, wait time, etc., but it doesn't
give good information about overall cluster utilization.


I'd like to see these kind of metrics on cluster use:

        histogram of CPU utilization, ie:
                Utilization     Time
                100%            5%
                 90%            20%

        histogram of overall memory use, ie:
                Utilization     Time
                100%            0%
                 90%            60%

        correlation between jobs waiting (CPUs idle) and available memory, as
        in:
                Jan 1   14:00 - 20:00
                        avg 4GB free/node
                        avg 50% CPU-slots used
                        avg 12GB RAM request for jobs in 'qw'
                                memory is constraint, cluster is fully
                                utilized but CPUs are idle

                Jan 8   08:00 - 14:00
                        avg 32GB free/node
                        avg 98% CPU-slots used
                        avg 2GB RAM request for jobs in 'qw'
                                CPU is constraint, cluster is fully
                                utilized but memory is unused


        number of jobs queued/waiting (excluding 'hold' jobs)

        number of CPUs requested vs [CPU time/wallclock time]
                (useful for detecting if users are requesting multiple
                cores in the 'threaded' PE but running single-threaded
                jobs)

        amount of memory used per job as a function of request, ie:
                requested       used avg
                =========       ========
                4GB             2.1GB
                12GB            9GB
                20GB            17GB

        average duration job spends in 'qw' state

        duration of queue time as a function of number of CPUs requested, ie
                1CPU    1hr avg in 'qw'
                2CPU    2hr avg in 'qw'
                4CPU    12hr avg in 'qw'

        duration of queue time as a function of amount of RAM requested
                4GB     1hr avg in 'qw'
                12GB    2hr avg in 'qw'
                20GB    12hr avg in 'qw'

I think that the only way to get this information would be to run 'qstats'
periodically, capture & process that data....any better suggestions or
scripts that anyone can share?

Thanks,

Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

--
Jesse Becker (Contractor)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to