I have gmetad configured to poll the first 14 nodes in our clusters, but
this weekend the first node in one of our clusters had a kernel panic. 
I believe the node was still up and listen on port 8649, so it was
probably responding to connection requests but was not sending any data
back.  This one problem node caused the gmetad thread that was polling
that cluster to not get any data so no data was recorded for that
cluster until we noticed the problem and rebooted that first node in the
cluster on Monday morning.  The node running gmetad only had messages
like this written at varying intervals ranging from 10 seconds to 30
minutes in the system log:

Apr  5 08:00:05 ganglia01 /usr/sbin/gmetad[2728]: poll() timeout 

* always the same thread pid.

More info in the error message would be useful, like which node timed
out and exactly how long it was waiting (it should have only been 10
seconds but maybe something else went wrong).

I briefly looked at the code in gmetad/data_thread.c and looks like it
is supposed to timeout after only 10 seconds and mark the whole data
source as dead, then sleep for the step interval seconds +/-5 and try
the data source all over again, starting from the beginning.  If the
first node that gmetad connects to successfully has a problem sending
the data because of high load or some other problem then gmetad might
never collect data for that cluster, like what happened to us.

I think a better algorithm would be to keep trying hosts in the data
sources list until the end of the list is reached or the xml data was
successfully read from one host.  Then that host should be remembered
and used each time until it has a problem.  Any comments or other
suggestions?

~Jason


-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/



Reply via email to