Re: [Ganglia-general] Hosts appear to be down

Peter Cogan Mon, 21 Apr 2014 13:17:35 -0700

Hi Dan,

interesting theory - is there any test I can do to test it?


thanks
Peter


On Mon, Apr 21, 2014 at 6:15 PM, Daniel M. Weeks <week...@rpi.edu> wrote:

> Hi Peter and Alexander,
>
> This might be a bit late but I have seen this happen in environments
> where network switches are not setup properly. They may begin by
> flooding multicast traffic and then prune it after a timer expires,
> which seems to match what you are describing - nodes can no longer see
> each other after a short period.
>
> - Dan
>
> On 04/01/2014 12:32 PM, Peter Cogan wrote:
> > Hi,
> >
> > I am continuing to dig into the problem.
> >
> > It seems that if I restart all the gmond daemons then I start collecting
> > data again - but only for a few minutes, and then they all stop pretty
> > much at the same time (they gmond daemons still running but it seems
> > like data is not being collected by gmetad or not being sent to gmetad).
> > What would cause this to happen after a few minutes of running fine?
> >
> > thanks
> > Peter
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Apr 1, 2014 at 1:48 PM, Alexander Karner <a...@de.ibm.com
> > <mailto:a...@de.ibm.com>> wrote:
> >
> >     Hi!
> >
> >     I see a similar situation in my client's environment, where various
> >     gmond sometimes fail to deliver data.
> >     However, after restarting the gmonds everything works fine again.
> >
> >     From my observations that could be related to a Qualys Security
> >     Scanner that hammers the systems with UDP packages.
> >
> >
> >
> >     Mit freundlichen Grüßen / Kind regards
> >
> >     *Alexander Karner*
> >
> >
> >
> >
> >
> >     From:        Peter Cogan <peter.co...@gmail.com
> >     <mailto:peter.co...@gmail.com>>
> >     To:        ganglia-general@lists.sourceforge.net
> >     <mailto:ganglia-general@lists.sourceforge.net>,
> >     Date:        01.04.2014 13:45
> >     Subject:        [Ganglia-general] Hosts appear to be down
> >
> ------------------------------------------------------------------------
> >
> >
> >
> >     Hi all,
> >
> >     I have recently installed ganglia on a small cluster with 4 servers
> >     (h101, h102, h103, h104) and am having an issue whereby the 3 slaves
> >     are reported as being down (even though they are up). In fact, I can
> >     make it work for a short while (see below on changing the directory
> >     owner) and then they appear as dead.
> >
> >     gmond is running all four machines and gmetad is running on the
> >     server (h101). The web interface is also working.
> >
> >     From what I can see, the slaves appear down from master's view
> >     because TN is high:
> >
> >     [root@h101 ~]# telnet h101 8649 | grep HOST | grep TN
> >     <HOST NAME="h102" IP="" REPORTED="1396176378" TN="174355" TMAX="20"
> >     DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396175888">
> >     <HOST NAME="h103" IP="" REPORTED="1396176382" TN="174351" TMAX="20"
> >     DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396179776">
> >     <HOST NAME="h104" IP="" REPORTED="1396176379" TN="174355" TMAX="20"
> >     DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176191">
> >     <HOST NAME="h101" IP="" REPORTED="1396350726" TN="8" TMAX="20"
> >     DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176013">
> >
> >     However if I perform the same command from any of the slaves, the
> >     see their own TN low and the others high, eg:
> >     [root@h101 ~]# telnet h102 8649 | grep HOST | grep TN
> >     <HOST NAME="h102" IP="hidden" REPORTED="1396350629" TN="2" TMAX="20"
> >     DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396284414">
> >     <HOST NAME="h103" IP="hidden" REPORTED="1396284601" TN="66030"
> >     TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396181187">
> >     <HOST NAME="h104" IP="hidden" REPORTED="1396284597" TN="66034"
> >     TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396177590">
> >     <HOST NAME="h101" IP="hidden" REPORTED="1396284599" TN="66032"
> >     TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176013">
> >
> >     I have tried restarting gmond on all machines and gmetad on the
> >     server but it doesn't help.
> >     I went through the FAQs - here are the results:
> >
> >       * For gmond:
> >           o See if the gmond service is running, issue the /ps aux|grep
> >             gmond/ command. Confirmed
> >
> >
> >           o Stop the gmond service and run it by hand with debug
> >             mode. //etc/init.d/gmond stop; /usr/sbin/gmond -d 2/. Look
> >             for errors near the top. No errors
> >           o Attempt to retrieve the XML data by netcatting to the gmond
> >             daemon. /nc <hostname> 8649 /Works for all hosts
> >
> >
> >           o Confirm that UDP connections can be established between the
> >             gmetad and gmond(or gmond and other gmond's for multicast
> >             purposes) by running /nc -u -l 8653/ on the host in
> >             question, then /echo "hello"|nc -u <hostname> 8653/ from the
> >             gmetad or another gmond. This works - but only for the first
> >             echo. If I try to send another message I get 'connection
> >             refused'. I have to stop and restart nc -u -l for it to
> >             receive the another message. Not sure if this is expected
> >             behaviour
> >
> >
> >           o Check gmond data using /usr/bin/gstat -a Each machine only
> >             sees itself
> >
> >
> >       * For gmetad:
> >           o See if the gmetad service is running, issue the /ps aux|grep
> >             gmetad/ command. Confirmed
> >           o Check syslog for errors. /tail /var/log/messages /No errors
> >
> >
> >           o Stop the gmetad service and run it by hand with debug
> >             mode. //etc/init.d/gmetad stop; /usr/sbin/gmetad -d 2/. Look
> >             for errors near the top. It starts with no errors, but I
> >             don't see data from the other hosts coming in
> >
> >
> >           o Ensure that //var/lib/ganglia/ and it's children are owned
> >             and writable by the /nobody/ user (/ganglia/ user on
> >             Debian/Ubuntu). I'm on RHEL and the user was set to ganglia.
> >             I changed it to nobody and restarted all daemons but now
> >             getting There was an error collecting ganglia data
> >             (_127.0.0.1:8652_ <http://127.0.0.1:8652/>): fsockopen
> >             error: Connection refused on the web interface. I changed it
> >             back to owner ganglia and restarted and suddenly the web
> >             page has data from all clusters - but only for a short
> >             while. I monitored using telnet as above and the TNs were
> >             being reset to low numbers for a short while, before
> >             increasing again and the hosts appeared dead again
> >
> >
> >           o Retrieve the XML data by netcatting to the gmetad
> >             daemon. /nc <hostname> 8650/. This information is useful for
> >             submitting bug reports. This returns with no output
> >
> >
> >
> >
> >
> >     thanks
> >     Peter
> >
> >
> ------------------------------------------------------------------------------
> >     _______________________________________________
> >     Ganglia-general mailing list
> >     Ganglia-general@lists.sourceforge.net
> >     <mailto:Ganglia-general@lists.sourceforge.net>
> >     https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> >
> >
> > _______________________________________________
> > Ganglia-general mailing list
> > Ganglia-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
>
>
> --
> Daniel M. Weeks
> Systems Programmer
> Center for Computational Innovations
> Rensselaer Polytechnic Institute
> Troy, NY 12180
> 518-276-4458
>

------------------------------------------------------------------------------
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform Software
Java Based Open Source Intranet - Social, Extensible, Cloud Ready
Get Started Now And Turn Your Intranet Into A Collaboration Platform
http://p.sf.net/sfu/ExoPlatform

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Hosts appear to be down

Reply via email to