Re: [Ganglia-general] grid graphs missing parts

Steven Wagner Fri, 21 Feb 2003 12:19:39 -0800

Nicholas Henke wrote:

OK -- so check this link, it is all of our clusters:
http://www.liniac.upenn.edu/ganglia.


Notice how the overall graph is spotty, but none of the others are? How
do I fix that ?

Nic

Hard to conclusively say without putting gmetad into debug mode and siftingthrough a couple hundred megs of debug output. If it makes you feelbetter, I'm seeing the same thing intermittently.

It seems to occur during load spikes on my front-end server (for those ofyou playing along at home, Sun E420R with 2x450MHz UltraSPARC IIi, internalstorage and gobs of RAM starts to sweat around 26,000 metrics and 800 hosts- takes 11 seconds to parse the XML each page load!). Enough actual datapoints are not being recorded during the final summarization stages of theRRD update process that the RRD can't generate a composite data point, andhence you get nothin' on the graph.

In my case it seems to be one data collection thread that is havingtrouble. Either it's taking too long to parse the XML and update the RRDs,or it's encountering some kind of error condition. Last time I looked intogmetad's RRD update code, it was breaking out of THE ENTIRE UPDATE PROCESSupon encountering any error updating any one RRD.

At that stage, I was seeing a similar problem. It seemed to be tied tousing a NOW() value instead of an absolute timestamp when updating data onan RRD that had just been updated.

Of course, you shouldn't be updating an RRD twice in a second. That'sobviously a bug of some kind. But the fact of the matter is, it washappening, and the data collection thread immediately gave up trying toupdate the rest of the RRDs and summary info.

This explanation doesn't really cover why the grid summaries aren't beingupdated in your case. Except ... they're either updated by *EVERY* datathread as it finishes its own cluster summary, or they're updated by aseparate thread. I think they're updated by every thread. Since thathappens right at the end of an update, it's possible that you AREencountering the same thing I am, but with different data source threadstrying to update the same RRD within one second.

I remember writing a little hack that put in a 1-second retry aroundRRD_update() that fixed a lot of gaps, but that was back before we figuredout that RRD_update() was non-reentrant and what it really needed was alocking mechanism.

But I still maintain that gmetad should be passing the timestamp at whichthe cluster was polled by the data source thread to RRD_update() instead ofrelying on NOW() ... after all, it's entirely possible that there could bebetween 3 and 20 seconds of delay between NOW() and the time the data wasactually received.

I'd try to do it myself but I'm trapped in design hell on another project.I just wanted to do a brain dump on this problem so that at least theinfo's floating around here on the list, and might potentially help someonehaving the same problem.

Re: [Ganglia-general] grid graphs missing parts

Reply via email to