Hi all,

I have recently installed ganglia on a small cluster with 4 servers (h101,
h102, h103, h104) and am having an issue whereby the 3 slaves are reported
as being down (even though they are up). In fact, I can make it work for a
short while (see below on changing the directory owner) and then they
appear as dead.

gmond is running all four machines and gmetad is running on the server
(h101). The web interface is also working.

>From what I can see, the slaves appear down from master's view because TN
is high:

[root@h101 ~]# telnet h101 8649 | grep HOST | grep TN
<HOST NAME="h102" IP="" REPORTED="1396176378" TN="174355" TMAX="20"
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396175888">
<HOST NAME="h103" IP="" REPORTED="1396176382" TN="174351" TMAX="20"
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396179776">
<HOST NAME="h104" IP="" REPORTED="1396176379" TN="174355" TMAX="20"
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176191">
<HOST NAME="h101" IP="" REPORTED="1396350726" TN="8" TMAX="20" DMAX="0"
LOCATION="unspecified" GMOND_STARTED="1396176013">

However if I perform the same command from any of the slaves, the see
their own TN low and the others high, eg:

[root@h101 ~]# telnet h102 8649 | grep HOST | grep TN
<HOST NAME="h102" IP="hidden" REPORTED="1396350629" TN="2" TMAX="20"
DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396284414">
<HOST NAME="h103" IP="hidden" REPORTED="1396284601" TN="66030"
TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396181187">
<HOST NAME="h104" IP="hidden" REPORTED="1396284597" TN="66034"
TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396177590">
<HOST NAME="h101" IP="hidden" REPORTED="1396284599" TN="66032"
TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176013">

I have tried restarting gmond on all machines and gmetad on the server
but it doesn't help.

I went through the FAQs - here are the results:


   - For gmond:
      - See if the gmond service is running, issue the *ps aux|grep
gmond* command. Confirmed
      - Stop the gmond service and run it by hand with debug mode.
*/etc/init.d/gmond stop; /usr/sbin/gmond -d 2*. Look for errors near
the top. No errors
      - Attempt to retrieve the XML data by netcatting to the gmond
daemon. *nc <hostname> 8649 *Works for all hosts
      - Confirm that UDP connections can be established between the
gmetad and gmond(or gmond and other gmond's for multicast purposes) by
running *nc -u -l 8653* on the host in question, then *echo "hello"|nc
-u <hostname> 8653* from the gmetad or another gmond. This works - but
only for the first echo. If I try to send another message I get
'connection refused'. I have to stop and restart nc -u -l for it to
receive the another message. Not sure if this is expected behaviour
      - Check gmond data using /usr/bin/gstat -a Each machine only sees itself


   - For gmetad:
      - See if the gmetad service is running, issue the *ps aux|grep
gmetad* command. Confirmed
      - Check syslog for errors. *tail /var/log/messages *No errors
      - Stop the gmetad service and run it by hand with debug mode.
*/etc/init.d/gmetad stop; /usr/sbin/gmetad -d 2*. Look for errors near
the top. It starts with no errors, but I don't see data from the other
hosts coming in
      - Ensure that */var/lib/ganglia* and it's children are owned and
writable by the *nobody* user (*ganglia* user on Debian/Ubuntu). I'm
on RHEL and the user was set to ganglia. I changed it to nobody and
restarted all daemons but now getting There was an error collecting
ganglia data (127.0.0.1:8652): fsockopen error: Connection refused on
the web interface. I changed it back to owner ganglia and restarted
and suddenly the web page has data from all clusters - but only for a
short while. I monitored using telnet as above and the TNs were being
reset to low numbers for a short while, before increasing again and
the hosts appeared dead again
      - Retrieve the XML data by netcatting to the gmetad daemon. *nc
<hostname> 8650*. This information is useful for submitting bug
reports. This returns with no output



thanks

Peter
------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to