Re: [Ganglia-developers] gaps in gmetad graphs

Jason A. Smith Sat, 15 Mar 2003 08:56:25 -0800

Thanks for the explanation, I have read some of the rrdtool
documentation before, but it still seems like every time I read it again
I learn something new and understand it a little better.


One note, remember, the interval is no longer guaranteed to always be 15
seconds, it can be set in the gmetad config file.  This brings up a
question I had about this.  What is the step interval used in the
summary plots?  What happens if I have multiple data sources that have
different polling intervals?  Will this mess up the summary updates? 
Based on Steven's suggestion, it seems like every time you have a gmetad
query another gmetad with its grid authority turned off, you will have
to at least double its polling interval unless the the rrd writing is
improved like he said.

Increasing the heartbeat is one simple way improve the gap problem, but
like Steven said before, better rrd update error recovery needs to be
done too.  At least when getting the min one second step interval errors
caused by an rrd update that has the same cluster timestamp as the
previous update.  The rrd creation also has to be fixed to set a --start
time equal to or before the timestamp that will be used in that first
update.

I also noticed another minor problem.  If you try to use a grid or
cluster name that has a slash in it, then the mkdir system call will
obviously fail since it assumes the slash is a path separator.  Maybe a
quick check for a slash in the name before mkdir is called should be
done, or at least it should be documented somewhere, maybe in the
example config file.

~Jason


On Fri, 2003-03-14 at 20:15, matt massie wrote:
> the gaps in the gmetad graphs are caused by *UNKNOWN* data.  let's walk 
> through this is figure out what is going on....
> 
> here is the relavent gmetad code...in ./gmetad/rrd_helpers.c RRD_create().
> 
> ----------------- begin code snip -------------------------
> /* Our heartbeat is twice the step interval which is always 15. */
>    heartbeat = 2*step;
> 
>    argv[argc++] = "dummy";
>    argv[argc++] = rrd;
>    argv[argc++] = "--step";
>    sprintf(s, "%u", step);
>    argv[argc++] = s;
>    sprintf(sum,"DS:sum:GAUGE:%d:U:U", heartbeat);
>    argv[argc++] = sum;
>    if (summary) {
>       sprintf(num,"DS:num:GAUGE:%d:U:U", heartbeat);
>       argv[argc++] = num;
>    }
>    argv[argc++] = "RRA:AVERAGE:0.5:1:240";
>    argv[argc++] = "RRA:AVERAGE:0.5:24:240";
>    argv[argc++] = "RRA:AVERAGE:0.5:168:240";
>    argv[argc++] = "RRA:AVERAGE:0.5:672:240";
>    argv[argc++] = "RRA:AVERAGE:0.5:5760:370";
> ------------------ end code snip --------------------
> 
> for every RRDb the step is 15 and the heartbeat is 30.  for non-summary 
> databases we have one DS (data source) called "sum".  summary databases 
> also have a second DS called "num" which hold the number of hosts in the 
> summation.  both the "num" and "sum" DS have a 30 second heartbeat and the 
> max and min values are set to "U" meaning.. they don't exist.
> 
> there are 5 RRA (round-robin archives).  each RRA uses the AVERAGE 
> consolidation function and has a 0.5 xff (The xfiles factor defines 
> what part of a consolidation interval may be made up from *UNKNOWN* data 
> while the consolidated value is still regarded as known).  in short, if 
> half of the values over a consolidation internal are UNKNOWN then the 
> whole consolidated value is marked as *UNKNOWN*.
> 
> here is something else to think about... and i'll comment more 
> afterwards...
> -------------------------------------------------------------------------
> http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/manual/rrdcreate.html
> -------------------------------------------------------------------------
> Here is an explanation by Don Baarda on the inner workings of rrdtool. It 
> may help you to sort out why all this *UNKNOWN* data is popping up in your 
> databases:
> 
> RRD gets fed samples at arbitrary times. From these it builds Primary Data 
> Points (PDPs) at exact times every ``step'' interval. The PDPs are then 
> accumulated into RRAs.
> 
> The ``heartbeat'' defines the maximum acceptable interval between samples. 
> If the interval between samples is less than ``heartbeat'', then an 
> average rate is calculated and applied for that interval. If the interval 
> between samples is longer than ``heartbeat'', then that entire interval is 
> considered ``unknown''. Note that there are other things that can make a 
> sample interval ``unknown'', such as the rate exceeding limits, or even an 
> ``unknown'' input sample.
> 
> The known rates during a PDP's ``step'' interval are used to calculate an 
> average rate for that PDP. Also, if the total ``unknown'' time during the 
> ``step'' interval exceeds the ``heartbeat'', the entire PDP is marked as 
> ``unknown''. This means that a mixture of known and ``unknown'' sample 
> time in a single PDP ``step'' may or may not add up to enough ``unknown'' 
> time to exceed ``heartbeat'' and hence mark the whole PDP ``unknown''. So 
> ``heartbeat'' is not only the maximum acceptable interval between samples, 
> but also the maximum acceptable amount of ``unknown'' time per PDP 
> (obviously this is only significant if you have ``heartbeat'' less than 
> ``step'').
> 
> The ``heartbeat'' can be short (unusual) or long (typical) relative to the 
> ``step'' interval between PDPs. A short ``heartbeat'' means you require 
> multiple samples per PDP, and if you don't get them mark the PDP unknown. 
> A long heartbeat can span multiple ``steps'', which means it is acceptable 
> to have multiple PDPs calculated from a single sample. An extreme example 
> of this might be a ``step'' of 5mins and a ``heartbeat'' of one day, in 
> which case a single sample every day will result in all the PDPs for that 
> entire day period being set to the same average rate. 
> -- Don Baarda <[EMAIL PROTECTED]>
> ----------------------------- end great info --------------------------
> 
> ok.. wow... let's try to simplify this..
> 
> first.. everything in rrdland is a simple timestamp/value pair.  the
> primary data points (PDPs) are "snapped" to the specified "step" interval 
> (even if it's not exact)... 
> 
> for example...
> 
> 00:00 insert value 5
> 00:20 insert value 10
> 00:35 insert value 7
> 00:45 insert value 9
> 00:60 insert value 10
> 
> here is what rrd returns... (i actually ran this using rrdtool, btw)...
> 
> 00:00 5
> 00:15 9.333333333333333
> 00:30 8.4
> 00:45 8.066666666666666
> 00:60 9.866666666666666
> 
> sooo.... at 15 second intervals rrdtool interpolates... it knows at 00:00
> the value is 5 and at 00:20 the value is 10.. on and on and on...  it is
> interpolating at each step along the way. 
> 
> wow!  that gives me a great idea of how to make gmetad MUCH less disk i/o
> intensive... (have a HUGE heartbeat and use explicit *UNKNOWN* values for
> dead data sources and only write significant CHANGES in value...later...
> g3)... focus.. focus...
> 
> the heartbeat is currently set way too small.  since it is only 2x the
> step, if any data source takes 30 seconds to collect, parse and write
> (that'll happen!).. then it gets marked as *UNKNOWN*.
> 
> here is a test to see if we can reduce the gaps in your images... 
> 
> 1. (re)move your old RRDbs in /var/lib/ganglia/rrds 
>    (i know.. that sux.. sorry)
> 
> 2. change line 79 in ./gmetad/rrd_helpers.c from
> 
>       /* Our heartbeat is twice the step interval. */
>       heartbeat = 2*step;
> 
>    to be
> 
>       /* Out heartbeat interval is eight times the step interval */
>       heartbeat = 8*step;
> 
> 3. recompile gmetad and give it a try.
> 
> this new gmetad will likely have much less gaps but the only catch is 
> this.  if a data source goes offline, you will not see the gap in the 
> graph (telling you the data source is dead) until eight steps (2 minutes).  
> i think that is a small price to pay.
> 
> i'm sorry that you haven't heard much from me lately... my time is being
> consumed by writing ganglia 3 and doing talks (about ganglia).  i don't
> want to over-promise anything so mums the word but the current limitations
> of gmetad (v2) will disappear in v3.  i hope this small hack helps.
> 
> -matt
> 
> ps. have a great weekend guys!
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by:Crypto Challenge is now open! 
> Get cracking and register here for some mind boggling fun and 
> the chance of winning an Apple iPod:
> http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
> _______________________________________________
> Ganglia-developers mailing list
> Ganglia-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Re: [Ganglia-developers] gaps in gmetad graphs

Reply via email to