[been waiting to spring that one :) ]
So it looks like gappy graphs are back on my large cluster data source
(which is actually now three or four hosts instead of just the one). I
haven't checked gmetad lately but does anyone know if it round-robins the
query hosts or is it just building a list and going top-to-bottom on each
pass until it gets a response? Is there a timeout period involved?
Funny thing is, I haven't changed anything lately. Maybe this is
system-related...
The number of hosts on the grid has doubled (and the load average has grown
by at least that much) since the last time I saw this behavior (which looks
very similar - hosts and metadata from one data source are being updated
intermittently, the rest are fine).
Running gmetad with debug output, I see that it's bombing out of one of the
loops where it updates the RRD files. It's either updating a file twice in
one pass (the RRD update list for one loop iteration is about 4500 lines
long, so I can't be sure from an xterm :) ), or it's updating at least one
RRD file twice in one second. Yes, it's the "minimum one second step" bug!
The RRD library will not tolerate this, so it returns an error, which kills
the entire update loop (how graceful!).
A couple ideas:
* I'll check tomorrow and see if it's updating stuff twice. If so,
obviously it's a bug that needs squarshifyin', and the rest of this is
irrelevant.
* Why always update with N (== "Now") ? Hmm, come to think of it, I
wonder if that might have been my idea. But ... isn't it more accurate to
actually use a host/cluster reporting timestamp, or a time() call made when
the XML is parsed, rather than leave it up to RRD?
* I find it annoying that the RRD library won't do a simple average of the
existing data point for "right this second" and the update submission that
you offer it at the same time. After all, it *does* do a weighted average
for submissions between data points to begin with...
* Falling out of the entire rrd update loop seems excessive. Failing on
just that RRD seems more logical (plus, RRDs are designed to be somewhat
resilient). Although this way the error is much more apparent. :)
* Pizza for dinner.
Anyway, I'm not sure about any of those ideas except the last one, which is
always a great idea.
I'm running a post-2.5.0 but pre-2.5.1 gmetad. I've checked the 2.5.1
source, though, and it appears to be basically the same as far as the
update functionality's concerned...