We have an administrative tool that does a number of things to make sure a node is configured correctly and resets several things. One of the things it does is to make sure the network is configured correctly and then does a network restart. Ganglia's listening threads don't seem to like it when the interface it is listening to is brought down and back up again. Although ganglia continues to broadcast metrics, it no longer receives them, which normally isn't a problem unless that node happens to be the one that gmetad is polling. This is the error message that starts appearing in the system log after the network restart:
Apr 9 17:40:09 rcas6010 /usr/sbin/gmond[15746]: mcast_thread() error multicasting This is actually quite bad since gmond continues to run and report its stale xml data, so gmetad never even tries the other hosts in the data source. Is it possible for gmond to recover from these types of errors? Maybe by closing and recreating its listening socket when it gets these errors? ~Jason -- /------------------------------------------------------------------\ | Jason A. Smith Email: [EMAIL PROTECTED] | | Atlas Computing Facility, Bldg. 510M Phone: (631)344-4226 | | Brookhaven National Lab, P.O. Box 5000 Fax: (631)344-7616 | | Upton, NY 11973-5000 | \------------------------------------------------------------------/