What does your gmetad.conf look like ?

I would perhaps simplify the config by following

http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_quick_start

Use a single send channel. Let's see if that fixes the issue. We can expand upon it.

Vladimir

On Wed, 26 Oct 2011, Lance Smith wrote:

I've been able to partially trackdown why my nodes are only reported as "up" for ~6 
minutes before going "down" on the web
interface- it looks like gmond stops receiving multicast and unicast messages. 
Restarting gmond on the “gatherer node”
re-establishes listening for about 6 minutes, then gmond gets nothing from the 
other nodes.

A cron job is running to reset gmond every hour but that all that does is make 
for a saw tooth uptime chart. Restarting gmetad
isn't needed as it seems to be communicating to gmond ok- this is reflected in 
the web charts (i tried it anyways just in case to
no avail). gmetad reports only the localhost's gmond is up. The rest are "down" 
until gmond is restarted.

Is there anything else i can check? Any ideas what can i can do to make the systems be 
reported as "up"?
Recompiled with apr-1.4.5 (originally 1.2.7), no effect (another posting had 
this as a problem)
OS = Centos 5.5 and 5.6
ganglia 3.2.0
confuse 2.7
pcre 8.13
rrdtool 1.4.4
No Ipv6
network: all on the same switch
 
Per this posting by Avani Sharma there is a problem with older versions of apr:
http://sourceforge.net/mailarchive/message.php?msg_id=27794074
[root@lando ganglia]# ldd /usr/local/sbin/gmond | grep apr
libapr-1.so.0 => /usr/local/apr/lib/libapr-1.so.0 (0x00002b7fa3feb000)
[root@lando ganglia# /usr/local/apr/bin/apr-1-config --version
1.4.5
 
 
 
/etc/gmond.conf (from system lando)
http://pastebin.com/d1XcBs18
only changed clustername/port, and added the other hosts to unicast. The other 
nodes have this node to unicast.

******Here is tcpdump when the systems are marked as down, from the “gathering 
gmond node”: **********
[root@lando ziggy]# /usr/sbin/tcpdump -i any ip multicast
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 96 bytes
09:54:13.273830 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:13.273851 IP lando.33023 > 239.2.11.71.8641: UDP, length 44
09:54:13.273870 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:13.273890 IP lando.33023 > 239.2.11.71.8641: UDP, length 44
09:54:13.273909 IP lando.33023 > 239.2.11.71.8641: UDP, length 44
09:54:13.273920 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:13.273935 IP lando.33023 > 239.2.11.71.8641: UDP, length 44
09:54:33.274233 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:33.274255 IP lando.33023 > 239.2.11.71.8641: UDP, length 44
09:54:33.274277 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:33.274295 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:33.274314 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:33.274334 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:33.274365 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:34.274377 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:35.274753 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
[snip]
 
********** Then i restart gmond on the gathering gmond which makes things all 
better**********
[root@lando ziggy]# /sbin/service gmond restart
Shutting down GANGLIA gmond:                               [  OK  ]
Starting GANGLIA gmond:                                    [  OK  ]
[root@lando ziggy]#

******Here is tcpdump when the systems are marked as up: **********
[root@lando ziggy]# /usr/sbin/tcpdump -i any ip multicast
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 96 bytes
09:56:47.346753 IP yoda.46045 > 239.2.11.71.8641: UDP, length 176
09:56:47.346801 IP yoda.46045 > 239.2.11.71.8641: UDP, length 48
09:56:47.346807 IP yoda.46045 > 239.2.11.71.8641: UDP, length 204
09:56:47.346857 IP yoda.46045 > 239.2.11.71.8641: UDP, length 48
09:56:47.347078 IP lando.58895 > 239.2.11.71.8641: UDP, length 28
09:56:47.347298 IP lando.58895 > 239.2.11.71.8641: UDP, length 32
09:56:50.289075 IP han.51957 > 239.2.11.71.8641: UDP, length 52
09:56:50.289084 IP han.51957 > 239.2.11.71.8641: UDP, length 56 |
09:56:50.289084 IP han.51957 > 239.2.11.71.8641: UDP, length 56
09:56:50.289096 IP han.51957 > 239.2.11.71.8641: UDP, length 56
09:56:50.289099 IP han.51957 > 239.2.11.71.8641: UDP, length 56
09:56:50.289200 IP yoda.46045 > 239.2.11.71.8641: UDP, length 28
[snip]
 
Approximately 6 minutes later gmond stops listening and only listens to itself 
again. Restarting gmond on the other nodes has no
effect on the listening of the gathering-gmond. It's always ~6 minutes, never 5 
or 10.
 
[root@lando ganglia]# telnet localhost 8641 | grep 192.168
<HOST NAME="luke" IP="192.168.1.1" TAGS="" REPORTED="1315501243" TN="8" TMAX="20" 
DMAX="0" LOCATION="unspecified"
GMOND_STARTED="1315499617">
<HOST NAME="yoda" IP="192.168.1.2" TAGS="" REPORTED="1315501250" TN="1" TMAX="20" 
DMAX="0" LOCATION="unspecified"
GMOND_STARTED="1315500568">
<HOST NAME="han" IP="192.168.1.7" TAGS="" REPORTED="1315501250" TN="1" TMAX="20" DMAX="0" 
LOCATION="unspecified"
GMOND_STARTED="1315499611">
<HOST NAME="lando" IP="192.168.1.8" TAGS="" REPORTED="1315501250" TN="1" TMAX="20" 
DMAX="0" LOCATION="unspecified"
GMOND_STARTED="1315501003">
Connection closed by foreign host.
 

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to