I've been thinking about the standard metrics in ganglia and I've run into a number of cases where the assumptions are either generally wrong or overly Linux specific. It would be good to think about ways to address these issues. There are three classes of metrics I am concerned with: cpu speed, cpu counters, and memory usage.
CPU speed is probably the easiest one. In FreeBSD 6.0 (and probably 5.5) users will be able to control the speed of their CPUs at runtime either manually or via a monitoring daemon the increases CPU speed in the face of load. Because of that, the current model where speed doesn't change no longer makes sense. There's an argument for a static report of the maximum CPU speed, but I think we should also report a dynamic current speed since that potentially interesting, particularly in non-compute-bound clusters. CPU counters are another issue. Most systems support reporting user, nice, system, interrupt, and idle usage, but ganglia also reports wio and sintr. These are interesting values, but they aren't always available. Presumably, other systems support other counts that are meaningful there. It seems like what we need here is the ability to say, "this set of named values adds to 100% of this resource" with corresponding support in both gmetad and the web frontend. There might also need to be a way to sort those values into groups to allow somewhat useful summary graphs for mixed clusters. Off the top of my head, grouping CPU into user, system, and idle possibly plus interrupt would give use pretty graphs for just about anything. Memory is another issue. In FreeBSD, we can account for all memory, but the traditional used+free model just doesn't work. On any OS with a decent VM, the steady state of memory is occupied even if active processes aren't using it. This is because free memory is wasted and could be put to use holding things like the contents of files that were accessed at some point in the past since they might be accessed again, and if they aren't, freeing the memory when something more important comes along is close enough to free that the possible speedup is more then worth it. Most OSes do this, some just seem to lie to their users about it. On FreeBSD, most memory is divided into active, inactive, wired, cache, buffer, and free. On a system with a few hundred MB of RAM 1-5MB is a typical value for free memory even with nothing running. To add to the incompatibility, we don't even really try to report shared memory. I could probably write some code to do this for SysV and POSIX shm* memory, but I doubt there would be a sane way to report mmaped files being used as shared memory. As with CPU, I think we probably need a tiered reporting system. Any thoughts on these issues? -- Brooks -- Any statement of the form "X is the one, true Y" is FALSE. PGP fingerprint 655D 519C 26A7 82E7 2529 9BF0 5D8E 8BE9 F238 1AD4
pgpQm2DOOmJNj.pgp
Description: PGP signature