I figured out what the problem was. I looked at the gmetad source code
here: gmetad/process_xml.c, and remembered that I had one gmetad that
was older than 2.5. For older versions of the xml data, gmetad uses a
different method for calculating the host_up flag:
abs(cluster_localtime - reported)
I have two gmetad servers, setup using the scalability option so they
each keep their own copy of the rrds. The second gmetad, which gets
data from the first one, appears to be keeping corrupted summary
statistics occasionally. Every few minutes it thinks some of the nodes
are down. The result i