We have an administrative tool that does a number of things to make sure
a node is configured correctly and resets several things.  One of the
things it does is to make sure the network is configured correctly and
then does a network restart.  Ganglia's listening threads don't seem to
like it when the interface it is listening to is brought down and back
up again.  Although ganglia continues to broadcast metrics, it no longer
receives them, which normally isn't a problem unless that node happens
to be the one that gmetad is polling.  This is the error message that
starts appearing in the system log after the network restart:

Apr  9 17:40:09 rcas6010 /usr/sbin/gmond[15746]: mcast_thread() error
multicasting 

This is actually quite bad since gmond continues to run and report its
stale xml data, so gmetad never even tries the other hosts in the data
source.  Is it possible for gmond to recover from these types of errors?
Maybe by closing and recreating its listening socket when it gets these
errors?

~Jason


-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/


Reply via email to