Nicholas Henke wrote:
OK -- so check this link, it is all of our clusters:
http://www.liniac.upenn.edu/ganglia.

Notice how the overall graph is spotty, but none of the others are? How
do I fix that ?

Nic

Hard to conclusively say without putting gmetad into debug mode and sifting through a couple hundred megs of debug output. If it makes you feel better, I'm seeing the same thing intermittently.

It seems to occur during load spikes on my front-end server (for those of you playing along at home, Sun E420R with 2x450MHz UltraSPARC IIi, internal storage and gobs of RAM starts to sweat around 26,000 metrics and 800 hosts - takes 11 seconds to parse the XML each page load!). Enough actual data points are not being recorded during the final summarization stages of the RRD update process that the RRD can't generate a composite data point, and hence you get nothin' on the graph.

In my case it seems to be one data collection thread that is having trouble. Either it's taking too long to parse the XML and update the RRDs, or it's encountering some kind of error condition. Last time I looked into gmetad's RRD update code, it was breaking out of THE ENTIRE UPDATE PROCESS upon encountering any error updating any one RRD.

At that stage, I was seeing a similar problem. It seemed to be tied to using a NOW() value instead of an absolute timestamp when updating data on an RRD that had just been updated.

Of course, you shouldn't be updating an RRD twice in a second. That's obviously a bug of some kind. But the fact of the matter is, it was happening, and the data collection thread immediately gave up trying to update the rest of the RRDs and summary info.

This explanation doesn't really cover why the grid summaries aren't being updated in your case. Except ... they're either updated by *EVERY* data thread as it finishes its own cluster summary, or they're updated by a separate thread. I think they're updated by every thread. Since that happens right at the end of an update, it's possible that you ARE encountering the same thing I am, but with different data source threads trying to update the same RRD within one second.

I remember writing a little hack that put in a 1-second retry around RRD_update() that fixed a lot of gaps, but that was back before we figured out that RRD_update() was non-reentrant and what it really needed was a locking mechanism.

But I still maintain that gmetad should be passing the timestamp at which the cluster was polled by the data source thread to RRD_update() instead of relying on NOW() ... after all, it's entirely possible that there could be between 3 and 20 seconds of delay between NOW() and the time the data was actually received.

I'd try to do it myself but I'm trapped in design hell on another project. I just wanted to do a brain dump on this problem so that at least the info's floating around here on the list, and might potentially help someone having the same problem.


Reply via email to