Hi, I am continuing to dig into the problem.
It seems that if I restart all the gmond daemons then I start collecting data again - but only for a few minutes, and then they all stop pretty much at the same time (they gmond daemons still running but it seems like data is not being collected by gmetad or not being sent to gmetad). What would cause this to happen after a few minutes of running fine? thanks Peter On Tue, Apr 1, 2014 at 1:48 PM, Alexander Karner <a...@de.ibm.com> wrote: > Hi! > > I see a similar situation in my client's environment, where various gmond > sometimes fail to deliver data. > However, after restarting the gmonds everything works fine again. > > From my observations that could be related to a Qualys Security Scanner > that hammers the systems with UDP packages. > > > > Mit freundlichen Grüßen / Kind regards > > *Alexander Karner* > > > > > > From: Peter Cogan <peter.co...@gmail.com> > To: ganglia-general@lists.sourceforge.net, > Date: 01.04.2014 13:45 > Subject: [Ganglia-general] Hosts appear to be down > ------------------------------ > > > > Hi all, > > I have recently installed ganglia on a small cluster with 4 servers (h101, > h102, h103, h104) and am having an issue whereby the 3 slaves are reported > as being down (even though they are up). In fact, I can make it work for a > short while (see below on changing the directory owner) and then they > appear as dead. > > gmond is running all four machines and gmetad is running on the server > (h101). The web interface is also working. > > From what I can see, the slaves appear down from master's view because TN > is high: > > [root@h101 ~]# telnet h101 8649 | grep HOST | grep TN > <HOST NAME="h102" IP="" REPORTED="1396176378" TN="174355" TMAX="20" > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396175888"> > <HOST NAME="h103" IP="" REPORTED="1396176382" TN="174351" TMAX="20" > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396179776"> > <HOST NAME="h104" IP="" REPORTED="1396176379" TN="174355" TMAX="20" > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176191"> > <HOST NAME="h101" IP="" REPORTED="1396350726" TN="8" TMAX="20" DMAX="0" > LOCATION="unspecified" GMOND_STARTED="1396176013"> > > However if I perform the same command from any of the slaves, the see > their own TN low and the others high, eg: > [root@h101 ~]# telnet h102 8649 | grep HOST | grep TN > <HOST NAME="h102" IP="hidden" REPORTED="1396350629" TN="2" TMAX="20" > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396284414"> > <HOST NAME="h103" IP="hidden" REPORTED="1396284601" TN="66030" TMAX="20" > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396181187"> > <HOST NAME="h104" IP="hidden" REPORTED="1396284597" TN="66034" TMAX="20" > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396177590"> > <HOST NAME="h101" IP="hidden" REPORTED="1396284599" TN="66032" TMAX="20" > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176013"> > > I have tried restarting gmond on all machines and gmetad on the server but > it doesn't help. > I went through the FAQs - here are the results: > > - For gmond: > - See if the gmond service is running, issue the *ps aux|grep gmond* > command. > Confirmed > > > - Stop the gmond service and run it by hand with debug mode. > */etc/init.d/gmond > stop; /usr/sbin/gmond -d 2*. Look for errors near the top. No errors > - Attempt to retrieve the XML data by netcatting to the gmond > daemon. *nc <hostname> 8649 *Works for all hosts > > > - Confirm that UDP connections can be established between the > gmetad and gmond(or gmond and other gmond's for multicast purposes) by > running *nc -u -l 8653* on the host in question, then *echo > "hello"|nc -u <hostname> 8653* from the gmetad or another gmond. > This works - but only for the first echo. If I try to send another > message > I get 'connection refused'. I have to stop and restart nc -u -l for it > to > receive the another message. Not sure if this is expected behaviour > > > - Check gmond data using /usr/bin/gstat -a Each machine only sees > itself > > > - For gmetad: > - See if the gmetad service is running, issue the *ps aux|grep > gmetad* command. Confirmed > - Check syslog for errors. *tail /var/log/messages *No errors > > > - Stop the gmetad service and run it by hand with debug mode. > */etc/init.d/gmetad > stop; /usr/sbin/gmetad -d 2*. Look for errors near the top. It > starts with no errors, but I don't see data from the other hosts coming > in > > > - Ensure that */var/lib/ganglia* and it's children are owned and > writable by the *nobody* user (*ganglia* user on Debian/Ubuntu). > I'm on RHEL and the user was set to ganglia. I changed it to nobody and > restarted all daemons but now getting There was an error collecting > ganglia > data (*127.0.0.1:8652* <http://127.0.0.1:8652/>): fsockopen error: > Connection refused on the web interface. I changed it back to owner > ganglia > and restarted and suddenly the web page has data from all clusters - but > only for a short while. I monitored using telnet as above and the TNs > were > being reset to low numbers for a short while, before increasing again > and > the hosts appeared dead again > > > - Retrieve the XML data by netcatting to the gmetad daemon. *nc > <hostname> 8650*. This information is useful for submitting bug > reports. This returns with no output > > > > > > > thanks > Peter > > > ------------------------------------------------------------------------------ > _______________________________________________ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > >
------------------------------------------------------------------------------
_______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general