I figured out what the problem was.  I looked at the gmetad source code
here: gmetad/process_xml.c, and remembered that I had one gmetad that
was older than 2.5.  For older versions of the xml data, gmetad uses a
different method for calculating the host_up flag:

abs(cluster_localtime - reported) < 60

while for xml data newer that ganglia 2.5 it uses:  tn < tmax * 4

Because of how we have ganglia setup, the xml data goes through a couple
levels of gmetads and some host's data can easily be older than 60
seconds which causes sporadic intervals where many hosts were being
marked as down.

This does point out a few problems with gmetad and the webfrontend:

1. gmetad and the webfrontend currently have different methods for
determining then a host should be marked as down.  gmetad will use 60
seconds since the last reported metric for old xml and 4 times tmax for
new xml, while the webfrontend uses the same method for new xml data but
uses: abs(cluster_localtime - reported) < 4*60 for old xml data, see the
ganglia.php:host_alive function.  This inconsistency and the fact that
we had an old gmetad, made the webfrontend view confusing.  The summary
data indicated that some nodes were down because the xml data was a
little older than 60 seconds for some nodes, while the webfrontend
showed no nodes down in the cluster view because it was older than 60
seconds, but wasn't older than 4 minutes.  These should be made
consistant.

2. 60 seconds seems a little low to me, especially if you are using the
scalability option and a few gmetads or a longer than default update
interval.

A separate problem we are seeing is when one of gmetad's data source
threads gets stuck (maybe when it can't contact any hosts in its
cluster), then that cluster's data stops being updated.  When this
happens the webfrontend is very misleading.  The grid summary xml data
and rrd graphs decrease their totals because of that "down" cluster, but
the rest of the grid view on the webfrontend doesn't indicate which
cluster is down unless you look very closely at the time values on the
cluster graphs.  What happens is that the right edge of the rrd graphs
for all other working clusters appears normal (right edge is the current
time), but for the down cluster the right edge never advances in time. 
It stays at the last time which that cluster was updated.  Since this is
displayed in a table format you would expect the horizontal range to all
be the same for each column, right?  Also, the cpu and host totals to
the left of the graphs still show no problem.  Is this a problem with
gmetad summary data, the webfrontend or both?  Maybe the total grid
summary is showing decreased numbers but the summary for the affected
cluster still has old data?  If the right edge on the graph of the
affected cluster was the current time and the summary data for that
cluster correctly indicated it was old, maybe by marking all hosts as
down then we could more easily identify where the problem is.

Sorry for the long email....
~Jason


On Wed, 2004-07-14 at 15:05, Jason A. Smith wrote:
> I have two gmetad servers, setup using the scalability option so they
> each keep their own copy of the rrds.  The second gmetad, which gets
> data from the first one, appears to be keeping corrupted summary
> statistics occasionally.  Every few minutes it thinks some of the nodes
> are down.  The result is bad summary data being written to the rrds,
> which causes graphs like the example I have attached.
> 
> I have tried looking at the raw xml data, from both gmetad servers, to
> see if I could find the cause, but I didn't see anything unusual. 
> During the small intervals that it thinks some of the nodes are down, if
> I query the second gmetad for its filter=summary output, I get results
> like this:
> 
> $ echo "/?filter=summary" | nc localhost 8652 | grep '<HOSTS'
> <HOSTS UP="63" DOWN="29" SOURCE="gmetad"/>
> 
> Also, when this happens the top-level grid display on the webfrontend
> shows 29 hosts down for the whole grid and for the cluster reported from
> that second gmetad.  If I look at that cluster view though, none of the
> hosts are marked as down and the raw xml looks good.  It is only the
> filter=summary xml that looks bad and the rrd graphs.  From this, I can
> only assume that gmetad must have corrupt summary statistics internally
> which it reports on the query port and writes into the rrds.  Is this
> some thread timing or data locking issue?  Does anyone else see this?
> 
> Ganglia version: 2.5.6
> 
> ~Jason
-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/



Reply via email to