I do agree with you about making sure that the second gmetad has a polling interval set to be the same or greater than the interval in the gmetad that it is getting data from. What is a safe value to use here anyway? This does help, but not eliminate the problem. It looks like I still get frequent gaps, just not as many RRD_update (minimum one second step) errors. I still get these errors sometimes, even with the second gmetad's interval set to 2-3 times what the first one is using. Also, this does not solve the problem I get when first starting gmetad with an empty rrds directory though. It polls the gmetad data source for the first time, then tries to create the first rrd database, and I get the following errors:
$ /usr/sbin/gmetad -d99 Going to run as user nobody Sources are ... Source: [BNL ATLAS] has 1 sources 130.199.207.251 listening on port 8651 Data thread 3076 is monitoring [BNL ATLAS] data source 130.199.207.251 Created rrd /var/lib/ganglia/rrds/ATLAS Linux Cluster/acas040.usatlas.bnl.gov/load_one.rrd RRD_update: illegal attempt to update using time 1047668352 when last update time is 1047668354 (minimum one second step) Writing Summary data for source ATLAS Linux Cluster Created rrd /var/lib/ganglia/rrds/ATLAS Linux Cluster/__SummaryInfo__/load_one.rrd RRD_update: illegal attempt to update using time 1047668352 when last update time is 1047668354 (minimum one second step) data_thread() couldn't parse the XML and data to RRD for [BNL ATLAS] I think I know what is causing this problem now after looking at the gmetda sources and the rrd documentation. With the timestamp patch, the rrd update may be using a time stamp that could be several tens of seconds in the past, whereas before it was using the current time. Also, when the rrds are created, no --start time is specified so the default is only 10 seconds in the past. I think the above errors happen when the cluster time is more than 10 seconds in the past which will happen frequently when you try to have one gmetad poll another gmetad. ~Jason What is the cause of the gaps in the rrd graphs anyway? Is it missing data, maybe from what the heartbeat is set to when the rrds are created, or does it come from trying to update it with two identical timestamps, or maybe both? What should the polling interval to be when you want to have multiple levels of gmetads running? Is there any way to fix this "minimum one second step" problem once and for all? Like maybe save the last time the rrds were updated and skip the current update if the timestamp hasn't changed? Any other ideas? On Fri, 2003-03-14 at 13:11, Steven Wagner wrote: > Jason A. Smith wrote: > > Which cause the now famous gaps in the rrd graphs when looking at the > > hour resolution. The 2.5.2 version of gmetad does not have a problem > > getting data from gmetad with the grid tag removed, but 2.5.3 does so I > > can only assume it must be related to the new timestamp patch. Does > > anyone have an idea what might be wrong? I probably won't have time to > > investigate this more till next week. > > The timestamp patch goes off of the cluster's timestamp tag. If a gmetad > polls a gmetad data source more often than the second gmetad polls its > gmond data source, it's going to get duplicate data. > > It's worth pointing out (in my defense :P ) that documentation on the new > grid features of gmetad is, erm, rather light. In my subjective reality, > all those changes magically appeared overnight. So I'm not sure how it all > works. ;) > > A couple possibilities: > > * Change the value being passed to the first RRD update function (whose > name escapes me at the moment) from the cluster localtime to time(0). > * Add some logic to test whether the new cluster localtime is equivalent > to the last one, and to pass time(0) if this is in fact the case. > > Both pretty straightforward fixes. I'd tend towards the second (more > complicated) option because the cluster time's supposed to be the most > accurate time figure in the XML. > > But here's one more. Let me put my devil's advocate hat on before I > continue ... there we go... > > Is this the timestamp code's fault? The cluster time value is supposed to > be the time the data source(s) for that cluster were last polled. We > recognize a dead cluster from in front of gmetad by the fact that the > cluster time and host timestamp metrics do not update in the XML. If we > operate under this assumption, then the recursive gmetads are behaving > correctly (if not of their own accord - it's the RRD library that's forcing > it!) by not updating the RRDs in this instance. Adjusting the polling > intervals could fix this (and is probably the "most correct" although least > convenient way to do so) - as long as the original gmetad is polling its > gmond data source more often than other gmetads are polling it, this > condition won't occur. > > -- /------------------------------------------------------------------\ | Jason A. Smith Email: [EMAIL PROTECTED] | | Atlas Computing Facility, Bldg. 510M Phone: (631)344-4226 | | Brookhaven National Lab, P.O. Box 5000 Fax: (631)344-7616 | | Upton, NY 11973-5000 | \------------------------------------------------------------------/