guys-

i didn't really touch on how we are going to process the collected
values in the new ganglia architecture.

one the glaring problems with ganglia 2.x is there is no builtin in
trigger/alert mechanism.  many people have built their own by parsing
the XML but it wasn't a real part of ganglia.

that is going to change.

i explained in my last message about how modules will sync data to an
in-memory filesystem but didn't give many details (saying only that a
sync function simply creates a monitor data tree).  for example,

/load/one
/load/five
/load/fifteen

or

/processes/10232/
/processes/10222/

etc.

what ganglia needs to do after the data is quantified is QUALIFY it. 
that is what was missing before.

when we load a module and start a process of syncing data, ganglia keeps
the current and previous monitoring directory trees.  if you remember,
the monitoring tree is made up of directories, content and data.  

  directory ("cpu")
     |
     + content ("user")
         |
         + data ("45.2")

once the data is synced, it needs to be qualified.

the first qualification is whether the difference between the latest
monitoring tree and the current monitoring tree values are significant
(is a load of "1.34" significantly different than "1.32"?).  this is a
relative qualification.  

when you define a module, you will specify a relative qualification
function (such as a difference of 1 or strcmp() for string values).

there are also absolute qualifications.

when ganglia parses the old and new monitoring trees it will quickly
determine values that are NEW and DELETED.  for example,

if the old tree had

/processes/2034
/processes/2021

and the new tree has

/processes/2222
/processes/2021

we have one NEW (/processes/2222) and one DELETED (/processes/2034)
monitoring tree entries.

ganglia also needs to qualify values into NORMAL, WARNING and ALERT
based on absolute value thresholds (if /cpu/system is > 80% then we have
a WARNING if /cpu/system > 90% we have an alert).  these qualifications
don't depend on the last value collected.

using this architecture we are respond to the following monitoring
events:

- new entry created
- old entry deleted
- entry value is significantly different than last entry value
- entry value is in/out of a normal operating range.

this would allow an rrd modules (rrd_mod) to be laid out as such.

if ( new entry )
{
   create round-robin database
}
if ( entry value is significantly different )
{
   save value to round-robin database
}
if( entry is deleted )
{
   write NaN to round-robin database
}

these qualifying modules don't just have to write data locally but could
also be used to make sure the data is synced to a group "leader" host
(e.g. a master for a cluster).

if( entry value is significantly different )
{
   write entry to master over TCP/UDP/whatever
}
if( entry is deleted )
{
  notify master this value is not being monitored anymore
}

let me know if you see any flaws in the logic to this approach.

i'm having good luck transforming our old metric library into a more
general purpose beast to be plugged into the 3.0.0 architecture.

more to come in the near future.

-matt

-- 
PGP fingerprint 'A7C2 3C2F 8445 AD3C 135E F40B 242A 5984 ACBC 91D3'

   They that can give up essential liberty to obtain a little 
      temporary safety deserve neither liberty nor safety. 
  --Benjamin Franklin, Historical Review of Pennsylvania, 1759

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to