[Ganglia-developers] Moving all built-in metrics to metric modules...

Brad Nicholes Tue, 18 Dec 2007 14:45:17 -0800

   I just committed a rather substantial patch to Ganglia 3.1.0 trunk which 
will affect the way that gmond 3.1.x is deployed.  I am posting this to both 
the developer list and the general list so that all will be aware of the 
changes and why they are important.  The primary purpose for the patch was to 
remove all of the built in metrics out of the gmond binary and allow them to be 
built as loadable modules.  The following is a more detailed list of what has 
changed.  Hopefully from a user perspective, gmond will continue to work as it 
has in the past.  But going forward, it will be much more flexible with regards 
to the core set of metrics.


* All built-in metrics have been removed from the gmond binary
  - A new set of core metric modules have been created that represent the same 
set metrics that gmond has always gathered.  These new core modules are 
mod_cpu.so, mod_disk.so, mod_load.so, mod_mem.so, mod_net.so, mod_proc.so and 
mod_sys.so.  Each of these modules is basically a wrapper around the metric 
functions that exist in libmetrics.  Being wrappers, they still make the same 
metric function calls as have always been made.  And since libmetrics contains 
all of the platform specific metric code, the metric function calls made by the 
core modules will continue to do the right thing for all of the platforms that 
have been previously supported.  
 - There is also an extra module called core_metrics which contains the 
heartbeat, location and gexec metrics.  Even though this module could be 
dynamically loaded in the same manner as the others, it is always statically 
linked simply because gmond would not be able to function properly without 
these metrics so there is no real reason to allow these metrics to be 
dynamically loaded.
  - Some additional configuration has been added to the gmond.conf file.  
Because the core metrics are now implemented as modules, this requires a module 
configuration block that instructs gmond to load each module.  A set of module 
blocks has been added to the default gmond.conf file.

* All metric specific metadata definitions have been removed from protocol.x
  - With the  refactoring of the XDR data and removal of the builtin metrics, 
there is no longer any need for XDR to have intimate knowledge of the core 
metrics.  Therefore the metric structure array and enum have been removed and 
are now part of the core metric modules themselves.

* --enable-static-build statically links the core metric modules
  - Building gmond statically will statically link not only APR, expat and 
libconfuse, it will also statically link all of the core metric modules into 
the gmond binary.  The result should be a gmond binary that looks and feels 
just like the old 3.0.x statically linked gmond binary.  The one exception is 
that a module statement is still required in the gmond.conf file.  The 
difference between the module configuration block for dynamically loaded 
modules and the module blocks for statically linked modules is whether or not a 
path to the .so is included.  The configure script and makefiles have been 
modified to detect --enable-static-build and build the default gmond.conf file 
appropriately.

* --enable-static-build + --enable-python statically links the python module
  - One of the downsides of building gmond 3.1.x statically was that doing so 
would disable all of the dynamically loadable module capability.  The reason 
for this is the need for both gmond and the pluggable modules to dynamically 
link with libapr1.  However, if both --enable-static-build and --enable-python 
are specified during configure, a gmond binary will be built with mod_python 
statically linked.  This provides gmond with the ability to continue to load 
and run python metric modules in the same manner as the non-static build.  In 
other words, even though statically linking gmond will disable pluggable C 
interface modules, python pluggable modules will still continue to work.

* All metrics carry a group designation
  - Now that all metrics have been implemented as loadable modules, the metrics 
have also been assigned to groups.  The XML that is produced by gmond and 
gmetad will carry an <EXTRA_DATA GROUP="blah"> tag that defines which group 
each metric belongs to.  This will allow the web front end to be enhanced to 
filter metrics so that they can be display by group rather than all metric 
graphs appearing on the same page.


These changes should make gmond much more flexible when it comes to extending 
or replacing not only the core metrics but also new metrics.  I have attached 
the wish list that was compiled a couple of months ago which updates the items 
that I consider to be done.  As I mentioned at our meet-up a few weeks ago, we 
need to identify which of the remaining items must be addressed before shipping 
3.1.0 and get those completed.  I would like to see us ship a 3.1.0 release as 
soon as possible.  

Brad

Done
------------------
- C module interface as DSO
- mod_python Python module interface
- Dynamically link libraries like expat, apr, libconfuse
- Add TITLE attribute to the XDR data to communicate a human readable name
- Add a GROUP attribute to the XDR data
    This would allow metrics to declare the category that they belong to. The 
    category should be added at the metric definition level and not in the 
.conf file.
- Reimplement the built in metrics as C interface modules
- A cleaner XDR encoding:
    The current encoding scheme embeds too much information about which metrics
    gmond collects.  The encoding scheme should treat all metrics the same: as
    just "a metric".  The encoding should not care if the metric is 
    metric_cpu_speed, metric_swap_total or a user-defined "gmetric" one.
- Flexible method of adding extra metric metadata.
    We could include extra metadata, not just "alias"/"title".  For example, 
some
    metrics have a natural minimum and maximum value.  Perhaps coming up with an
    extendable way of encoding metric metadata so future changes can be included
    without loosing backwards compatibility.
- Re-organization of RPM packages (libganglia, gmond-python ?)


GMond To Do
------------------------
- Gmond module repository
- Implement a perl module interface
- Implement a PHP module interface
- Implement a Ruby module interface
- Metric packing:
    Simply that a UDP packet can contain multiple metrics (using the usual XDR
    stream decoding) up to the size of a UDP packet.  This would help reduce
    the overheads when sending many metric updates concurrently.  It also
    preserves the current gmond behaviour where it sends metric updates in
    a single UDP packet.
- Support for counters (metrics with +ve slope)
    This shouldn't require much work (from memory, make sure the slope-type
    information is preserved and patch gmetad to create RRD files with the
    correct options).  Currently Ganglia doesn't actually support custom
    counter metrics, which is an awkward limitation.
- gmond switching to a non-blocking IO model.
    If there's a large number of metric updates then gmond must process them
    "quickly" or they will be lost.  If this happens whilst gmond is sending XML
    data to gmetad there's may be a delay, increasing the risk of metric
    update messages being lost.  Switching to a non-blocking IO model would 
allow
    gmond to respond preferentially to the incoming UDP messages.
-* Remove the 4T limit on ganglia metric results
-* Modify all byte count metric to 8 bytes ints

GMetad To Do
------------------------------
- Support for new RRDTool which allows graphs to have dynamic sizes
- Gilad's stacked graphs
- Changing the units of default metrics to their base
    For example disk_free's base unit should be bytes, not GB as rrdtool will
    automatically append G,M,K etc.)
- Better support for bigger less frequent updates 
    one packet every 20 seconds per host for all data?
- Multi PB disk limit
- Better on disk RRD perf (tmpfs is an OK workaround)
-* Name RRD directories based on UUID generated by client gmond 
    has of MAC address? something else? So that renaming hosts, updating DNS or
    hosts files don't result in history for the phyiscal gmond client being 
lost.
- Integration of gexec/authd ?  
- Expand gstat nodelist parameter query options (i.e. return all hosts
with <10% iowait, etc.)
- Interface stats in bits?  Self awareness of interface capablity for %
util stats for network.
- Something like a unique per-gmond instance identifier
    To help with multi-homing and DNS issues and so the IP address is no 
    longer the index key. There was discussion of this under the subject 
    "Overriding hostname" on the Ganglia-general list.
- Give some metrics priority and have them updated more frequently in their 
RRDs than others.
- Allow for some sort of in memory RRD (never written to disk) as an 
alternative storage for very extreme cases.
- Let the users manage different IO bound pools for their metrics
    For extreme cases one based on tmpfs. So that they can be tied correctly 
    to the right kind of storage IO capabilities for the frequency needed.
- Add more memory metrics 
    slab, buffers, dirty, writeback, cache_clean  (= cached - 
dirty+writeback)), mapped, free

Web interface
-------------------------------
- Numerous custom graphs enhancements (Alex Balk, Timothy Witham, others)
- Web frontend face lift
- Mouse over result graphs
- Default cluster view uses text-only per host squares 
    loading 1700 little graphs chews too much browser
- Better icons.
    The current highly-compressed JPEG files for the icons look horrible!
    Line-art perhaps suffers worst from JPEG compression artifacts.  Could we 
not
    use either PNGs or (preferably) SVG?

- Add an option to allow switching to SVG in-line RRDTool graphs.
    This should be pretty easy to add as a config option.  I think support for
    SVG in current browsers is now "good enough".  A half-way modern version of
    RRDTool can generate SVG versions of the graphs, which should look much
    better.

- Have some standard way of describing custom graphs.
    There currently isn't a standard way of producing custom graphs; "custom"
    here means adding support for host-specific and cluster-specific graphs and
    also some framework for describing those custom graphs.  I have a
    solution, that (at least) has merit in both existing and working.  Perhaps 
it
    isn't ideal, but the Ganglia web front-end should provide at least some
    standard hooks if not an actual framework.

- Have the option to switch off displaying all the single-metric graphs.
    If you have ~300 metrics, the little graphs at the page bottom are all but
    useless.  They slow down the loading of the page without adding much 
insight.
    (I have a simple patch that allows a user to choose whether they want to see
    these graphs.)

- Fix the pie-chart-generating code.
    The current pie-chart code is a bit ugly and can plot things incorrectly
    under certain circumstances.  There must be some nicer graph plotting
    packages out there...

-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services
for just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace

_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

[Ganglia-developers] Moving all built-in metrics to metric modules...

Reply via email to