[been waiting to spring that one :) ]

So it looks like gappy graphs are back on my large cluster data source (which is actually now three or four hosts instead of just the one). I haven't checked gmetad lately but does anyone know if it round-robins the query hosts or is it just building a list and going top-to-bottom on each pass until it gets a response? Is there a timeout period involved?

Funny thing is, I haven't changed anything lately. Maybe this is system-related...

The number of hosts on the grid has doubled (and the load average has grown by at least that much) since the last time I saw this behavior (which looks very similar - hosts and metadata from one data source are being updated intermittently, the rest are fine).

Running gmetad with debug output, I see that it's bombing out of one of the loops where it updates the RRD files. It's either updating a file twice in one pass (the RRD update list for one loop iteration is about 4500 lines long, so I can't be sure from an xterm :) ), or it's updating at least one RRD file twice in one second. Yes, it's the "minimum one second step" bug!

The RRD library will not tolerate this, so it returns an error, which kills the entire update loop (how graceful!).

A couple ideas:

* I'll check tomorrow and see if it's updating stuff twice. If so, obviously it's a bug that needs squarshifyin', and the rest of this is irrelevant. * Why always update with N (== "Now") ? Hmm, come to think of it, I wonder if that might have been my idea. But ... isn't it more accurate to actually use a host/cluster reporting timestamp, or a time() call made when the XML is parsed, rather than leave it up to RRD? * I find it annoying that the RRD library won't do a simple average of the existing data point for "right this second" and the update submission that you offer it at the same time. After all, it *does* do a weighted average for submissions between data points to begin with... * Falling out of the entire rrd update loop seems excessive. Failing on just that RRD seems more logical (plus, RRDs are designed to be somewhat resilient). Although this way the error is much more apparent. :)
*  Pizza for dinner.

Anyway, I'm not sure about any of those ideas except the last one, which is always a great idea.

I'm running a post-2.5.0 but pre-2.5.1 gmetad. I've checked the 2.5.1 source, though, and it appears to be basically the same as far as the update functionality's concerned...


Reply via email to