I have gmetad configured to poll the first 14 nodes in our clusters, but this weekend the first node in one of our clusters had a kernel panic. I believe the node was still up and listen on port 8649, so it was probably responding to connection requests but was not sending any data back. This one problem node caused the gmetad thread that was polling that cluster to not get any data so no data was recorded for that cluster until we noticed the problem and rebooted that first node in the cluster on Monday morning. The node running gmetad only had messages like this written at varying intervals ranging from 10 seconds to 30 minutes in the system log:
Apr 5 08:00:05 ganglia01 /usr/sbin/gmetad[2728]: poll() timeout * always the same thread pid. More info in the error message would be useful, like which node timed out and exactly how long it was waiting (it should have only been 10 seconds but maybe something else went wrong). I briefly looked at the code in gmetad/data_thread.c and looks like it is supposed to timeout after only 10 seconds and mark the whole data source as dead, then sleep for the step interval seconds +/-5 and try the data source all over again, starting from the beginning. If the first node that gmetad connects to successfully has a problem sending the data because of high load or some other problem then gmetad might never collect data for that cluster, like what happened to us. I think a better algorithm would be to keep trying hosts in the data sources list until the end of the list is reached or the xml data was successfully read from one host. Then that host should be remembered and used each time until it has a problem. Any comments or other suggestions? ~Jason -- /------------------------------------------------------------------\ | Jason A. Smith Email: [EMAIL PROTECTED] | | Atlas Computing Facility, Bldg. 510M Phone: (631)344-4226 | | Brookhaven National Lab, P.O. Box 5000 Fax: (631)344-7616 | | Upton, NY 11973-5000 | \------------------------------------------------------------------/