Hi! I see a similar situation in my client's environment, where various gmond sometimes fail to deliver data. However, after restarting the gmonds everything works fine again.
>From my observations that could be related to a Qualys Security Scanner that hammers the systems with UDP packages. Mit freundlichen Grüßen / Kind regards Alexander Karner From: Peter Cogan <peter.co...@gmail.com> To: ganglia-general@lists.sourceforge.net, Date: 01.04.2014 13:45 Subject: [Ganglia-general] Hosts appear to be down Hi all, I have recently installed ganglia on a small cluster with 4 servers (h101, h102, h103, h104) and am having an issue whereby the 3 slaves are reported as being down (even though they are up). In fact, I can make it work for a short while (see below on changing the directory owner) and then they appear as dead. gmond is running all four machines and gmetad is running on the server (h101). The web interface is also working. >From what I can see, the slaves appear down from master's view because TN is high: [root@h101 ~]# telnet h101 8649 | grep HOST | grep TN <HOST NAME="h102" IP="" REPORTED="1396176378" TN="174355" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396175888"> <HOST NAME="h103" IP="" REPORTED="1396176382" TN="174351" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396179776"> <HOST NAME="h104" IP="" REPORTED="1396176379" TN="174355" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176191"> <HOST NAME="h101" IP="" REPORTED="1396350726" TN="8" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176013"> However if I perform the same command from any of the slaves, the see their own TN low and the others high, eg: [root@h101 ~]# telnet h102 8649 | grep HOST | grep TN <HOST NAME="h102" IP="hidden" REPORTED="1396350629" TN="2" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396284414"> <HOST NAME="h103" IP="hidden" REPORTED="1396284601" TN="66030" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396181187"> <HOST NAME="h104" IP="hidden" REPORTED="1396284597" TN="66034" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396177590"> <HOST NAME="h101" IP="hidden" REPORTED="1396284599" TN="66032" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176013"> I have tried restarting gmond on all machines and gmetad on the server but it doesn't help. I went through the FAQs - here are the results: For gmond: See if the gmond service is running, issue the ps aux|grep gmond command. Confirmed Stop the gmond service and run it by hand with debug mode. /etc/init.d/gmond stop; /usr/sbin/gmond -d 2. Look for errors near the top. No errors Attempt to retrieve the XML data by netcatting to the gmond daemon. nc <hostname> 8649 Works for all hosts Confirm that UDP connections can be established between the gmetad and gmond(or gmond and other gmond's for multicast purposes) by running nc -u -l 8653 on the host in question, then echo "hello"|nc -u <hostname> 8653 from the gmetad or another gmond. This works - but only for the first echo. If I try to send another message I get 'connection refused'. I have to stop and restart nc -u -l for it to receive the another message. Not sure if this is expected behaviour Check gmond data using /usr/bin/gstat -a Each machine only sees itself For gmetad: See if the gmetad service is running, issue the ps aux|grep gmetad command. Confirmed Check syslog for errors. tail /var/log/messages No errors Stop the gmetad service and run it by hand with debug mode. /etc/init.d/gmetad stop; /usr/sbin/gmetad -d 2. Look for errors near the top. It starts with no errors, but I don't see data from the other hosts coming in Ensure that /var/lib/ganglia and it's children are owned and writable by the nobody user (ganglia user on Debian/Ubuntu). I'm on RHEL and the user was set to ganglia. I changed it to nobody and restarted all daemons but now getting There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused on the web interface. I changed it back to owner ganglia and restarted and suddenly the web page has data from all clusters - but only for a short while. I monitored using telnet as above and the TNs were being reset to low numbers for a short while, before increasing again and the hosts appeared dead again Retrieve the XML data by netcatting to the gmetad daemon. nc <hostname> 8650. This information is useful for submitting bug reports. This returns with no output thanks Peter ------------------------------------------------------------------------------ _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
smime.p7s
Description: S/MIME Cryptographic Signature
------------------------------------------------------------------------------
_______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general