[Ganglia-developers] The Gap Into Madness

Steven Wagner Mon, 11 Nov 2002 18:23:30 -0800

[been waiting to spring that one :) ]

So it looks like gappy graphs are back on my large cluster data source(which is actually now three or four hosts instead of just the one). Ihaven't checked gmetad lately but does anyone know if it round-robins thequery hosts or is it just building a list and going top-to-bottom on eachpass until it gets a response? Is there a timeout period involved?

Funny thing is, I haven't changed anything lately. Maybe this issystem-related...

The number of hosts on the grid has doubled (and the load average has grownby at least that much) since the last time I saw this behavior (which looksvery similar - hosts and metadata from one data source are being updatedintermittently, the rest are fine).

Running gmetad with debug output, I see that it's bombing out of one of theloops where it updates the RRD files. It's either updating a file twice inone pass (the RRD update list for one loop iteration is about 4500 lineslong, so I can't be sure from an xterm :) ), or it's updating at least oneRRD file twice in one second. Yes, it's the "minimum one second step" bug!

The RRD library will not tolerate this, so it returns an error, which killsthe entire update loop (how graceful!).


A couple ideas:

* I'll check tomorrow and see if it's updating stuff twice. If so,obviously it's a bug that needs squarshifyin', and the rest of this isirrelevant.* Why always update with N (== "Now") ? Hmm, come to think of it, Iwonder if that might have been my idea. But ... isn't it more accurate toactually use a host/cluster reporting timestamp, or a time() call made whenthe XML is parsed, rather than leave it up to RRD?* I find it annoying that the RRD library won't do a simple average of theexisting data point for "right this second" and the update submission thatyou offer it at the same time. After all, it *does* do a weighted averagefor submissions between data points to begin with...* Falling out of the entire rrd update loop seems excessive. Failing onjust that RRD seems more logical (plus, RRDs are designed to be somewhatresilient). Although this way the error is much more apparent. :)

*  Pizza for dinner.

Anyway, I'm not sure about any of those ideas except the last one, which isalways a great idea.

I'm running a post-2.5.0 but pre-2.5.1 gmetad. I've checked the 2.5.1source, though, and it appears to be basically the same as far as theupdate functionality's concerned...

[Ganglia-developers] The Gap Into Madness

Reply via email to