I've been thinking about the standard metrics in ganglia and I've run
into a number of cases where the assumptions are either generally wrong
or overly Linux specific.  It would be good to think about ways to
address these issues.  There are three classes of metrics I am
concerned with: cpu speed, cpu counters, and memory usage.

CPU speed is probably the easiest one.  In FreeBSD 6.0 (and probably
5.5) users will be able to control the speed of their CPUs at runtime
either manually or via a monitoring daemon the increases CPU speed in
the face of load.  Because of that, the current model where speed
doesn't change no longer makes sense.  There's an argument for a static
report of the maximum CPU speed, but I think we should also report a
dynamic current speed since that potentially interesting, particularly in
non-compute-bound clusters.

CPU counters are another issue.  Most systems support reporting user,
nice, system, interrupt, and idle usage, but ganglia also reports wio and
sintr.  These are interesting values, but they aren't always available.
Presumably, other systems support other counts that are meaningful
there.  It seems like what we need here is the ability to say, "this set
of named values adds to 100% of this resource" with corresponding
support in both gmetad and the web frontend.  There might also need to
be a way to sort those values into groups to allow somewhat useful
summary graphs for mixed clusters.  Off the top of my head, grouping CPU
into user, system, and idle possibly plus interrupt would give use
pretty graphs for just about anything.

Memory is another issue.  In FreeBSD, we can account for all memory, but
the traditional used+free model just doesn't work.  On any OS with a
decent VM, the steady state of memory is occupied even if active
processes aren't using it.  This is because free memory is wasted and
could be put to use holding things like the contents of files that were
accessed at some point in the past since they might be accessed again,
and if they aren't, freeing the memory when something more important
comes along is close enough to free that the possible speedup is more
then worth it.  Most OSes do this, some just seem to lie to their users
about it.  On FreeBSD, most memory is divided into active, inactive,
wired, cache, buffer, and free.  On a system with a few hundred MB of
RAM 1-5MB is a typical value for free memory even with nothing running.
To add to the incompatibility, we don't even really try to report shared
memory.  I could probably write some code to do this for SysV and POSIX
shm* memory, but I doubt there would be a sane way to report mmaped
files being used as shared memory.  As with CPU, I think we probably
need a tiered reporting system.

Any thoughts on these issues?

-- Brooks

-- 
Any statement of the form "X is the one, true Y" is FALSE.
PGP fingerprint 655D 519C 26A7 82E7 2529  9BF0 5D8E 8BE9 F238 1AD4

Attachment: pgpQm2DOOmJNj.pgp
Description: PGP signature

Reply via email to