Re: [Ganglia-developers] Problems with the latest ganglia.

Jason A. Smith Fri, 14 Mar 2003 11:30:22 -0800

I do agree with you about making sure that the second gmetad has a
polling interval set to be the same or greater than the interval in the
gmetad that it is getting data from.  What is a safe value to use here
anyway?  This does help, but not eliminate the problem.  It looks like I
still get frequent gaps, just not as many RRD_update (minimum one second
step) errors.  I still get these errors sometimes, even with the second
gmetad's interval set to 2-3 times what the first one is using.  Also,
this does not solve the problem I get when first starting gmetad with an
empty rrds directory though.  It polls the gmetad data source for the
first time, then tries to create the first rrd database, and I get the
following errors:


$ /usr/sbin/gmetad -d99
Going to run as user nobody
Sources are ...
Source: [BNL ATLAS] has 1 sources
        130.199.207.251
listening on port 8651
Data thread 3076 is monitoring [BNL ATLAS] data source
        130.199.207.251
Created rrd /var/lib/ganglia/rrds/ATLAS Linux
Cluster/acas040.usatlas.bnl.gov/load_one.rrd
RRD_update: illegal attempt to update using time 1047668352 when last
update time is 1047668354 (minimum one second step)
Writing Summary data for source ATLAS Linux Cluster
Created rrd /var/lib/ganglia/rrds/ATLAS Linux
Cluster/__SummaryInfo__/load_one.rrd
RRD_update: illegal attempt to update using time 1047668352 when last
update time is 1047668354 (minimum one second step)
data_thread() couldn't parse the XML and data to RRD for [BNL ATLAS]

I think I know what is causing this problem now after looking at the
gmetda sources and the rrd documentation.  With the timestamp patch, the
rrd update may be using a time stamp that could be several tens of
seconds in the past, whereas before it was using the current time. 
Also, when the rrds are created, no --start time is specified so the
default is only 10 seconds in the past.  I think the above errors happen
when the cluster time is more than 10 seconds in the past which will
happen frequently when you try to have one gmetad poll another gmetad.

~Jason

What is the cause of the gaps in the rrd graphs anyway?  Is it missing
data, maybe from what the heartbeat is set to when the rrds are created,
or does it come from trying to update it with two identical timestamps,
or maybe both?  What should the polling interval to be when you want to
have multiple levels of gmetads running?

Is there any way to fix this "minimum one second step" problem once and
for all?  Like maybe save the last time the rrds were updated and skip
the current update if the timestamp hasn't changed?  Any other ideas?


On Fri, 2003-03-14 at 13:11, Steven Wagner wrote:
> Jason A. Smith wrote:
> > Which cause the now famous gaps in the rrd graphs when looking at the
> > hour resolution.  The 2.5.2 version of gmetad does not have a problem
> > getting data from gmetad with the grid tag removed, but 2.5.3 does so I
> > can only assume it must be related to the new timestamp patch.  Does
> > anyone have an idea what might be wrong?  I probably won't have time to
> > investigate this more till next week.
> 
> The timestamp patch goes off of the cluster's timestamp tag.  If a gmetad 
> polls a gmetad data source more often than the second gmetad polls its 
> gmond data source, it's going to get duplicate data.
> 
> It's worth pointing out (in my defense :P ) that documentation on the new 
> grid features of gmetad is, erm, rather light.  In my subjective reality, 
> all those changes magically appeared overnight.  So I'm not sure how it all 
> works. ;)
> 
> A couple possibilities:
> 
> *  Change the value being passed to the first RRD update function (whose 
> name escapes me at the moment) from the cluster localtime to time(0).
> *  Add some logic to test whether the new cluster localtime is equivalent 
> to the last one, and to pass time(0) if this is in fact the case.
> 
> Both pretty straightforward fixes.  I'd tend towards the second (more 
> complicated) option because the cluster time's supposed to be the most 
> accurate time figure in the XML.
> 
> But here's one more.  Let me put my devil's advocate hat on before I 
> continue ... there we go...
> 
> Is this the timestamp code's fault?  The cluster time value is supposed to 
> be the time the data source(s) for that cluster were last polled.  We 
> recognize a dead cluster from in front of gmetad by the fact that the 
> cluster time and host timestamp metrics do not update in the XML.  If we 
> operate under this assumption, then the recursive gmetads are behaving 
> correctly (if not of their own accord - it's the RRD library that's forcing 
> it!) by not updating the RRDs in this instance.  Adjusting the polling 
> intervals could fix this (and is probably the "most correct" although least 
> convenient way to do so) - as long as the original gmetad is polling its 
> gmond data source more often than other gmetads are polling it, this 
> condition won't occur.
> 
> 
-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/

Re: [Ganglia-developers] Problems with the latest ganglia.

Reply via email to