Re: [Ganglia-developers] Problems with the latest ganglia.

Jason A. Smith Fri, 14 Mar 2003 13:58:49 -0800

Sorry about my first email, I didn't mean to imply that there was
something wrong with your patch, just that it may have exposed another
problem.  I have been thinking about this more and I guess this bug was
only exposed by my change to gmetad to remove the GRID tag from its xml
output so it would force the second gmetad to create its own copy on the
rrds.


I still think the authority URL in gmetad's GRID tag should be an
option, not mandatory, and if this is changed in the next version then
this bug that I see now will have to be fixed.

How should multiple levels of gmetads be handled in the case where you
don't want to forward your gmetad's authority URL, with respect to
timestamps and updating the rrds.  In this case you would frequently get
timestamps more than 10 seconds in the past so creating the rrds would
have to be fixed.  Maybe instead of the default 10 seconds before the
current localtime, it should be 10 seconds before the timestamp that
came from the cluster report since that is now the time that is used
when updating the rrds.

Some thought also needs to be given to what happens with the update
interval with respect to the potential of trying to update the rrds with
an identical copy of the xml data if you query the other gmetad source
before it has a chance to update its xml data.  Or what is your interval
is too long compared to the source gmetad, will the missed data cause
gaps in the graphs?

Also, one other minor thing I mistakenly tripped over early on in my
testing.  Changing the update interval after the rrds have already been
created can cause some problems like many errors rrd_update errors or
gaps.  Is there any way for gmetad to detect that it has changed since
the rrds were created and either print a warning or create new
databases?  It is kind of important since once the rrds have been
created they expect to be continually updated at the same interval to
function properly, correct?

~Jason


On Fri, 2003-03-14 at 14:29, Jason A. Smith wrote:
> I do agree with you about making sure that the second gmetad has a
> polling interval set to be the same or greater than the interval in the
> gmetad that it is getting data from.  What is a safe value to use here
> anyway?  This does help, but not eliminate the problem.  It looks like I
> still get frequent gaps, just not as many RRD_update (minimum one second
> step) errors.  I still get these errors sometimes, even with the second
> gmetad's interval set to 2-3 times what the first one is using.  Also,
> this does not solve the problem I get when first starting gmetad with an
> empty rrds directory though.  It polls the gmetad data source for the
> first time, then tries to create the first rrd database, and I get the
> following errors:
> 
> $ /usr/sbin/gmetad -d99
> Going to run as user nobody
> Sources are ...
> Source: [BNL ATLAS] has 1 sources
>         130.199.207.251
> listening on port 8651
> Data thread 3076 is monitoring [BNL ATLAS] data source
>         130.199.207.251
> Created rrd /var/lib/ganglia/rrds/ATLAS Linux
> Cluster/acas040.usatlas.bnl.gov/load_one.rrd
> RRD_update: illegal attempt to update using time 1047668352 when last
> update time is 1047668354 (minimum one second step)
> Writing Summary data for source ATLAS Linux Cluster
> Created rrd /var/lib/ganglia/rrds/ATLAS Linux
> Cluster/__SummaryInfo__/load_one.rrd
> RRD_update: illegal attempt to update using time 1047668352 when last
> update time is 1047668354 (minimum one second step)
> data_thread() couldn't parse the XML and data to RRD for [BNL ATLAS]
> 
> I think I know what is causing this problem now after looking at the
> gmetda sources and the rrd documentation.  With the timestamp patch, the
> rrd update may be using a time stamp that could be several tens of
> seconds in the past, whereas before it was using the current time. 
> Also, when the rrds are created, no --start time is specified so the
> default is only 10 seconds in the past.  I think the above errors happen
> when the cluster time is more than 10 seconds in the past which will
> happen frequently when you try to have one gmetad poll another gmetad.
> 
> ~Jason
> 
> What is the cause of the gaps in the rrd graphs anyway?  Is it missing
> data, maybe from what the heartbeat is set to when the rrds are created,
> or does it come from trying to update it with two identical timestamps,
> or maybe both?  What should the polling interval to be when you want to
> have multiple levels of gmetads running?
> 
> Is there any way to fix this "minimum one second step" problem once and
> for all?  Like maybe save the last time the rrds were updated and skip
> the current update if the timestamp hasn't changed?  Any other ideas?
> 
> 
> On Fri, 2003-03-14 at 13:11, Steven Wagner wrote:
> > Jason A. Smith wrote:
> > > Which cause the now famous gaps in the rrd graphs when looking at the
> > > hour resolution.  The 2.5.2 version of gmetad does not have a problem
> > > getting data from gmetad with the grid tag removed, but 2.5.3 does so I
> > > can only assume it must be related to the new timestamp patch.  Does
> > > anyone have an idea what might be wrong?  I probably won't have time to
> > > investigate this more till next week.
> > 
> > The timestamp patch goes off of the cluster's timestamp tag.  If a gmetad 
> > polls a gmetad data source more often than the second gmetad polls its 
> > gmond data source, it's going to get duplicate data.
> > 
> > It's worth pointing out (in my defense :P ) that documentation on the new 
> > grid features of gmetad is, erm, rather light.  In my subjective reality, 
> > all those changes magically appeared overnight.  So I'm not sure how it all 
> > works. ;)
> > 
> > A couple possibilities:
> > 
> > *  Change the value being passed to the first RRD update function (whose 
> > name escapes me at the moment) from the cluster localtime to time(0).
> > *  Add some logic to test whether the new cluster localtime is equivalent 
> > to the last one, and to pass time(0) if this is in fact the case.
> > 
> > Both pretty straightforward fixes.  I'd tend towards the second (more 
> > complicated) option because the cluster time's supposed to be the most 
> > accurate time figure in the XML.
> > 
> > But here's one more.  Let me put my devil's advocate hat on before I 
> > continue ... there we go...
> > 
> > Is this the timestamp code's fault?  The cluster time value is supposed to 
> > be the time the data source(s) for that cluster were last polled.  We 
> > recognize a dead cluster from in front of gmetad by the fact that the 
> > cluster time and host timestamp metrics do not update in the XML.  If we 
> > operate under this assumption, then the recursive gmetads are behaving 
> > correctly (if not of their own accord - it's the RRD library that's forcing 
> > it!) by not updating the RRDs in this instance.  Adjusting the polling 
> > intervals could fix this (and is probably the "most correct" although least 
> > convenient way to do so) - as long as the original gmetad is polling its 
> > gmond data source more often than other gmetads are polling it, this 
> > condition won't occur.
> > 
> > 
-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/

Re: [Ganglia-developers] Problems with the latest ganglia.

Reply via email to