Hi Dan,
interesting theory - is there any test I can do to test it?
thanks
Peter
On Mon, Apr 21, 2014 at 6:15 PM, Daniel M. Weeks <week...@rpi.edu> wrote:
> Hi Peter and Alexander,
>
> This might be a bit late but I have seen this happen in environments
> where network switches are not setup properly. They may begin by
> flooding multicast traffic and then prune it after a timer expires,
> which seems to match what you are describing - nodes can no longer see
> each other after a short period.
>
> - Dan
>
> On 04/01/2014 12:32 PM, Peter Cogan wrote:
> > Hi,
> >
> > I am continuing to dig into the problem.
> >
> > It seems that if I restart all the gmond daemons then I start collecting
> > data again - but only for a few minutes, and then they all stop pretty
> > much at the same time (they gmond daemons still running but it seems
> > like data is not being collected by gmetad or not being sent to gmetad).
> > What would cause this to happen after a few minutes of running fine?
> >
> > thanks
> > Peter
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Apr 1, 2014 at 1:48 PM, Alexander Karner <a...@de.ibm.com
> > <mailto:a...@de.ibm.com>> wrote:
> >
> > Hi!
> >
> > I see a similar situation in my client's environment, where various
> > gmond sometimes fail to deliver data.
> > However, after restarting the gmonds everything works fine again.
> >
> > From my observations that could be related to a Qualys Security
> > Scanner that hammers the systems with UDP packages.
> >
> >
> >
> > Mit freundlichen Grüßen / Kind regards
> >
> > *Alexander Karner*
> >
> >
> >
> >
> >
> > From: Peter Cogan <peter.co...@gmail.com
> > <mailto:peter.co...@gmail.com>>
> > To: ganglia-general@lists.sourceforge.net
> > <mailto:ganglia-general@lists.sourceforge.net>,
> > Date: 01.04.2014 13:45
> > Subject: [Ganglia-general] Hosts appear to be down
> >
> ------------------------------------------------------------------------
> >
> >
> >
> > Hi all,
> >
> > I have recently installed ganglia on a small cluster with 4 servers
> > (h101, h102, h103, h104) and am having an issue whereby the 3 slaves
> > are reported as being down (even though they are up). In fact, I can
> > make it work for a short while (see below on changing the directory
> > owner) and then they appear as dead.
> >
> > gmond is running all four machines and gmetad is running on the
> > server (h101). The web interface is also working.
> >
> > From what I can see, the slaves appear down from master's view
> > because TN is high:
> >
> > [root@h101 ~]# telnet h101 8649 | grep HOST | grep TN
> > <HOST NAME="h102" IP="" REPORTED="1396176378" TN="174355" TMAX="20"
> > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396175888">
> > <HOST NAME="h103" IP="" REPORTED="1396176382" TN="174351" TMAX="20"
> > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396179776">
> > <HOST NAME="h104" IP="" REPORTED="1396176379" TN="174355" TMAX="20"
> > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176191">
> > <HOST NAME="h101" IP="" REPORTED="1396350726" TN="8" TMAX="20"
> > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176013">
> >
> > However if I perform the same command from any of the slaves, the
> > see their own TN low and the others high, eg:
> > [root@h101 ~]# telnet h102 8649 | grep HOST | grep TN
> > <HOST NAME="h102" IP="hidden" REPORTED="1396350629" TN="2" TMAX="20"
> > DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396284414">
> > <HOST NAME="h103" IP="hidden" REPORTED="1396284601" TN="66030"
> > TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396181187">
> > <HOST NAME="h104" IP="hidden" REPORTED="1396284597" TN="66034"
> > TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396177590">
> > <HOST NAME="h101" IP="hidden" REPORTED="1396284599" TN="66032"
> > TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1396176013">
> >
> > I have tried restarting gmond on all machines and gmetad on the
> > server but it doesn't help.
> > I went through the FAQs - here are the results:
> >
> > * For gmond:
> > o See if the gmond service is running, issue the /ps aux|grep
> > gmond/ command. Confirmed
> >
> >
> > o Stop the gmond service and run it by hand with debug
> > mode. //etc/init.d/gmond stop; /usr/sbin/gmond -d 2/. Look
> > for errors near the top. No errors
> > o Attempt to retrieve the XML data by netcatting to the gmond
> > daemon. /nc <hostname> 8649 /Works for all hosts
> >
> >
> > o Confirm that UDP connections can be established between the
> > gmetad and gmond(or gmond and other gmond's for multicast
> > purposes) by running /nc -u -l 8653/ on the host in
> > question, then /echo "hello"|nc -u <hostname> 8653/ from the
> > gmetad or another gmond. This works - but only for the first
> > echo. If I try to send another message I get 'connection
> > refused'. I have to stop and restart nc -u -l for it to
> > receive the another message. Not sure if this is expected
> > behaviour
> >
> >
> > o Check gmond data using /usr/bin/gstat -a Each machine only
> > sees itself
> >
> >
> > * For gmetad:
> > o See if the gmetad service is running, issue the /ps aux|grep
> > gmetad/ command. Confirmed
> > o Check syslog for errors. /tail /var/log/messages /No errors
> >
> >
> > o Stop the gmetad service and run it by hand with debug
> > mode. //etc/init.d/gmetad stop; /usr/sbin/gmetad -d 2/. Look
> > for errors near the top. It starts with no errors, but I
> > don't see data from the other hosts coming in
> >
> >
> > o Ensure that //var/lib/ganglia/ and it's children are owned
> > and writable by the /nobody/ user (/ganglia/ user on
> > Debian/Ubuntu). I'm on RHEL and the user was set to ganglia.
> > I changed it to nobody and restarted all daemons but now
> > getting There was an error collecting ganglia data
> > (_127.0.0.1:8652_ <http://127.0.0.1:8652/>): fsockopen
> > error: Connection refused on the web interface. I changed it
> > back to owner ganglia and restarted and suddenly the web
> > page has data from all clusters - but only for a short
> > while. I monitored using telnet as above and the TNs were
> > being reset to low numbers for a short while, before
> > increasing again and the hosts appeared dead again
> >
> >
> > o Retrieve the XML data by netcatting to the gmetad
> > daemon. /nc <hostname> 8650/. This information is useful for
> > submitting bug reports. This returns with no output
> >
> >
> >
> >
> >
> > thanks
> > Peter
> >
> >
> ------------------------------------------------------------------------------
> > _______________________________________________
> > Ganglia-general mailing list
> > Ganglia-general@lists.sourceforge.net
> > <mailto:Ganglia-general@lists.sourceforge.net>
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> >
> >
> > _______________________________________________
> > Ganglia-general mailing list
> > Ganglia-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
>
>
> --
> Daniel M. Weeks
> Systems Programmer
> Center for Computational Innovations
> Rensselaer Polytechnic Institute
> Troy, NY 12180
> 518-276-4458
>
------------------------------------------------------------------------------
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform Software
Java Based Open Source Intranet - Social, Extensible, Cloud Ready
Get Started Now And Turn Your Intranet Into A Collaboration Platform
http://p.sf.net/sfu/ExoPlatform
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general