Nicholas Henke wrote:
OK -- so check this link, it is all of our clusters:
http://www.liniac.upenn.edu/ganglia.
Notice how the overall graph is spotty, but none of the others are? How
do I fix that ?
Nic
Hard to conclusively say without putting gmetad into debug mode and sifting
through a couple hundred megs of debug output. If it makes you feel
better, I'm seeing the same thing intermittently.
It seems to occur during load spikes on my front-end server (for those of
you playing along at home, Sun E420R with 2x450MHz UltraSPARC IIi, internal
storage and gobs of RAM starts to sweat around 26,000 metrics and 800 hosts
- takes 11 seconds to parse the XML each page load!). Enough actual data
points are not being recorded during the final summarization stages of the
RRD update process that the RRD can't generate a composite data point, and
hence you get nothin' on the graph.
In my case it seems to be one data collection thread that is having
trouble. Either it's taking too long to parse the XML and update the RRDs,
or it's encountering some kind of error condition. Last time I looked into
gmetad's RRD update code, it was breaking out of THE ENTIRE UPDATE PROCESS
upon encountering any error updating any one RRD.
At that stage, I was seeing a similar problem. It seemed to be tied to
using a NOW() value instead of an absolute timestamp when updating data on
an RRD that had just been updated.
Of course, you shouldn't be updating an RRD twice in a second. That's
obviously a bug of some kind. But the fact of the matter is, it was
happening, and the data collection thread immediately gave up trying to
update the rest of the RRDs and summary info.
This explanation doesn't really cover why the grid summaries aren't being
updated in your case. Except ... they're either updated by *EVERY* data
thread as it finishes its own cluster summary, or they're updated by a
separate thread. I think they're updated by every thread. Since that
happens right at the end of an update, it's possible that you ARE
encountering the same thing I am, but with different data source threads
trying to update the same RRD within one second.
I remember writing a little hack that put in a 1-second retry around
RRD_update() that fixed a lot of gaps, but that was back before we figured
out that RRD_update() was non-reentrant and what it really needed was a
locking mechanism.
But I still maintain that gmetad should be passing the timestamp at which
the cluster was polled by the data source thread to RRD_update() instead of
relying on NOW() ... after all, it's entirely possible that there could be
between 3 and 20 seconds of delay between NOW() and the time the data was
actually received.
I'd try to do it myself but I'm trapped in design hell on another project.
I just wanted to do a brain dump on this problem so that at least the
info's floating around here on the list, and might potentially help someone
having the same problem.