I have two gmetad servers, setup using the scalability option so they
each keep their own copy of the rrds.  The second gmetad, which gets
data from the first one, appears to be keeping corrupted summary
statistics occasionally.  Every few minutes it thinks some of the nodes
are down.  The result is bad summary data being written to the rrds,
which causes graphs like the example I have attached.

I have tried looking at the raw xml data, from both gmetad servers, to
see if I could find the cause, but I didn't see anything unusual. 
During the small intervals that it thinks some of the nodes are down, if
I query the second gmetad for its filter=summary output, I get results
like this:

$ echo "/?filter=summary" | nc localhost 8652 | grep '<HOSTS'
<HOSTS UP="63" DOWN="29" SOURCE="gmetad"/>

Also, when this happens the top-level grid display on the webfrontend
shows 29 hosts down for the whole grid and for the cluster reported from
that second gmetad.  If I look at that cluster view though, none of the
hosts are marked as down and the raw xml looks good.  It is only the
filter=summary xml that looks bad and the rrd graphs.  From this, I can
only assume that gmetad must have corrupt summary statistics internally
which it reports on the query port and writes into the rrds.  Is this
some thread timing or data locking issue?  Does anyone else see this?

Ganglia version: 2.5.6

~Jason


-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/

<<inline: ganglia-graph.gif>>

Reply via email to