I figured out what the problem was. I looked at the gmetad source code here: gmetad/process_xml.c, and remembered that I had one gmetad that was older than 2.5. For older versions of the xml data, gmetad uses a different method for calculating the host_up flag:
abs(cluster_localtime - reported) < 60 while for xml data newer that ganglia 2.5 it uses: tn < tmax * 4 Because of how we have ganglia setup, the xml data goes through a couple levels of gmetads and some host's data can easily be older than 60 seconds which causes sporadic intervals where many hosts were being marked as down. This does point out a few problems with gmetad and the webfrontend: 1. gmetad and the webfrontend currently have different methods for determining then a host should be marked as down. gmetad will use 60 seconds since the last reported metric for old xml and 4 times tmax for new xml, while the webfrontend uses the same method for new xml data but uses: abs(cluster_localtime - reported) < 4*60 for old xml data, see the ganglia.php:host_alive function. This inconsistency and the fact that we had an old gmetad, made the webfrontend view confusing. The summary data indicated that some nodes were down because the xml data was a little older than 60 seconds for some nodes, while the webfrontend showed no nodes down in the cluster view because it was older than 60 seconds, but wasn't older than 4 minutes. These should be made consistant. 2. 60 seconds seems a little low to me, especially if you are using the scalability option and a few gmetads or a longer than default update interval. A separate problem we are seeing is when one of gmetad's data source threads gets stuck (maybe when it can't contact any hosts in its cluster), then that cluster's data stops being updated. When this happens the webfrontend is very misleading. The grid summary xml data and rrd graphs decrease their totals because of that "down" cluster, but the rest of the grid view on the webfrontend doesn't indicate which cluster is down unless you look very closely at the time values on the cluster graphs. What happens is that the right edge of the rrd graphs for all other working clusters appears normal (right edge is the current time), but for the down cluster the right edge never advances in time. It stays at the last time which that cluster was updated. Since this is displayed in a table format you would expect the horizontal range to all be the same for each column, right? Also, the cpu and host totals to the left of the graphs still show no problem. Is this a problem with gmetad summary data, the webfrontend or both? Maybe the total grid summary is showing decreased numbers but the summary for the affected cluster still has old data? If the right edge on the graph of the affected cluster was the current time and the summary data for that cluster correctly indicated it was old, maybe by marking all hosts as down then we could more easily identify where the problem is. Sorry for the long email.... ~Jason On Wed, 2004-07-14 at 15:05, Jason A. Smith wrote: > I have two gmetad servers, setup using the scalability option so they > each keep their own copy of the rrds. The second gmetad, which gets > data from the first one, appears to be keeping corrupted summary > statistics occasionally. Every few minutes it thinks some of the nodes > are down. The result is bad summary data being written to the rrds, > which causes graphs like the example I have attached. > > I have tried looking at the raw xml data, from both gmetad servers, to > see if I could find the cause, but I didn't see anything unusual. > During the small intervals that it thinks some of the nodes are down, if > I query the second gmetad for its filter=summary output, I get results > like this: > > $ echo "/?filter=summary" | nc localhost 8652 | grep '<HOSTS' > <HOSTS UP="63" DOWN="29" SOURCE="gmetad"/> > > Also, when this happens the top-level grid display on the webfrontend > shows 29 hosts down for the whole grid and for the cluster reported from > that second gmetad. If I look at that cluster view though, none of the > hosts are marked as down and the raw xml looks good. It is only the > filter=summary xml that looks bad and the rrd graphs. From this, I can > only assume that gmetad must have corrupt summary statistics internally > which it reports on the query port and writes into the rrds. Is this > some thread timing or data locking issue? Does anyone else see this? > > Ganglia version: 2.5.6 > > ~Jason -- /------------------------------------------------------------------\ | Jason A. Smith Email: [EMAIL PROTECTED] | | Atlas Computing Facility, Bldg. 510M Phone: (631)344-4226 | | Brookhaven National Lab, P.O. Box 5000 Fax: (631)344-7616 | | Upton, NY 11973-5000 | \------------------------------------------------------------------/