Hi!

I see a similar situation in my client's environment, where various gmond 
sometimes fail to deliver data.
However, after restarting the gmonds everything works fine again.

>From my observations that could be related to a Qualys Security Scanner 
that hammers the systems with UDP packages.



Mit freundlichen Grüßen / Kind regards

Alexander Karner





From:   Peter Cogan <peter.co...@gmail.com>
To:     ganglia-general@lists.sourceforge.net, 
Date:   01.04.2014 13:45
Subject:        [Ganglia-general] Hosts appear to be down



Hi all, 

I have recently installed ganglia on a small cluster with 4 servers (h101, 
h102, h103, h104) and am having an issue whereby the 3 slaves are reported 
as being down (even though they are up). In fact, I can make it work for a 
short while (see below on changing the directory owner) and then they 
appear as dead.

gmond is running all four machines and gmetad is running on the server 
(h101). The web interface is also working.

>From what I can see, the slaves appear down from master's view because TN 
is high:

[root@h101 ~]# telnet h101 8649 | grep HOST | grep TN
<HOST NAME="h102" IP="" REPORTED="1396176378" TN="174355" TMAX="20" 
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396175888">
<HOST NAME="h103" IP="" REPORTED="1396176382" TN="174351" TMAX="20" 
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396179776">
<HOST NAME="h104" IP="" REPORTED="1396176379" TN="174355" TMAX="20" 
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176191">
<HOST NAME="h101" IP="" REPORTED="1396350726" TN="8" TMAX="20" DMAX="0" 
LOCATION="unspecified" GMOND_STARTED="1396176013">

However if I perform the same command from any of the slaves, the see 
their own TN low and the others high, eg:
[root@h101 ~]# telnet h102 8649 | grep HOST | grep TN
<HOST NAME="h102" IP="hidden" REPORTED="1396350629" TN="2" TMAX="20" 
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396284414">
<HOST NAME="h103" IP="hidden" REPORTED="1396284601" TN="66030" TMAX="20" 
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396181187">
<HOST NAME="h104" IP="hidden" REPORTED="1396284597" TN="66034" TMAX="20" 
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396177590">
<HOST NAME="h101" IP="hidden" REPORTED="1396284599" TN="66032" TMAX="20" 
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176013">

I have tried restarting gmond on all machines and gmetad on the server but 
it doesn't help.
I went through the FAQs - here are the results:
For gmond:
See if the gmond service is running, issue the ps aux|grep gmond command. 
Confirmed



Stop the gmond service and run it by hand with debug mode. 
/etc/init.d/gmond stop; /usr/sbin/gmond -d 2. Look for errors near the 
top. No errors
Attempt to retrieve the XML data by netcatting to the gmond daemon. nc 
<hostname> 8649 Works for all hosts



Confirm that UDP connections can be established between the gmetad and 
gmond(or gmond and other gmond's for multicast purposes) by running nc -u 
-l 8653 on the host in question, then echo "hello"|nc -u <hostname> 8653
 from the gmetad or another gmond. This works - but only for the first 
echo. If I try to send another message I get 'connection refused'. I have 
to stop and restart nc -u -l for it to receive the another message. Not 
sure if this is expected behaviour



Check gmond data using /usr/bin/gstat -a Each machine only sees itself



For gmetad:
See if the gmetad service is running, issue the ps aux|grep gmetad
 command. Confirmed
Check syslog for errors. tail /var/log/messages No errors



Stop the gmetad service and run it by hand with debug mode. 
/etc/init.d/gmetad stop; /usr/sbin/gmetad -d 2. Look for errors near the 
top. It starts with no errors, but I don't see data from the other hosts 
coming in



Ensure that /var/lib/ganglia and it's children are owned and writable by 
the nobody user (ganglia user on Debian/Ubuntu). I'm on RHEL and the user 
was set to ganglia. I changed it to nobody and restarted all daemons but 
now getting There was an error collecting ganglia data (127.0.0.1:8652): 
fsockopen error: Connection refused on the web interface. I changed it 
back to owner ganglia and restarted and suddenly the web page has data 
from all clusters - but only for a short while. I monitored using telnet 
as above and the TNs were being reset to low numbers for a short while, 
before increasing again and the hosts appeared dead again



Retrieve the XML data by netcatting to the gmetad daemon. nc <hostname> 
8650. This information is useful for submitting bug reports. This returns 
with no output






thanks
Peter

------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to