That isn't exactly true.  I suspect the problem only lies with multicast
sockets.  TCP most certainly will not have any problems, other than a
small delay while packets are resent if they were rejected during the
short time interval that the interface was down, the same as how it can
recover from any other network glitch.

All I am asking is if it is possible for gmond to recreate the listening
multicast socket when it detects that it is having problems receiving
multicast packets.  It does detect the problem since error messages are
being written into the system log.  The tcp socket has no problems with
this network restart.  In fact, the real problem comes from the fact
that the tcp xml listening socket is still responding to requests, but
the multicast socket has died during the restart, resulting in old data
being sent.

~Jason


On Thu, 2003-04-10 at 10:40, Jim Rowan wrote:
> I can't answer the question about gmond -- but any application that has
> a socket open is going to have trouble with your management practice.
> Maybe it needs to be smarter and not do the reconfig unless it's needed?
> (And if it is needed, it would likely be more appropriate to do a reboot
> -- who knows how many other apps are broken at this point...)
> 
> 
> -----Original Message-----
> From: Jason A. Smith [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, April 09, 2003 5:08 PM
> To: Ganglia Developers
> Subject: [Ganglia-developers] Network problems cause ganglia multicast
> errors.
> 
> We have an administrative tool that does a number of things to make sure
> a node is configured correctly and resets several things.  One of the
> things it does is to make sure the network is configured correctly and
> then does a network restart.  Ganglia's listening threads don't seem to
> like it when the interface it is listening to is brought down and back
> up again.  Although ganglia continues to broadcast metrics, it no longer
> receives them, which normally isn't a problem unless that node happens
> to be the one that gmetad is polling.  This is the error message that
> starts appearing in the system log after the network restart:
> 
> Apr  9 17:40:09 rcas6010 /usr/sbin/gmond[15746]: mcast_thread() error
> multicasting 
> 
> This is actually quite bad since gmond continues to run and report its
> stale xml data, so gmetad never even tries the other hosts in the data
> source.  Is it possible for gmond to recover from these types of errors?
> Maybe by closing and recreating its listening socket when it gets these
> errors?
> 
> ~Jason
-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/


Reply via email to