What does your gmetad.conf look like ? I would perhaps simplify the config by following
http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_quick_startUse a single send channel. Let's see if that fixes the issue. We can expand upon it.
Vladimir On Wed, 26 Oct 2011, Lance Smith wrote:
I've been able to partially trackdown why my nodes are only reported as "up" for ~6 minutes before going "down" on the web interface- it looks like gmond stops receiving multicast and unicast messages. Restarting gmond on the “gatherer node” re-establishes listening for about 6 minutes, then gmond gets nothing from the other nodes. A cron job is running to reset gmond every hour but that all that does is make for a saw tooth uptime chart. Restarting gmetad isn't needed as it seems to be communicating to gmond ok- this is reflected in the web charts (i tried it anyways just in case to no avail). gmetad reports only the localhost's gmond is up. The rest are "down" until gmond is restarted. Is there anything else i can check? Any ideas what can i can do to make the systems be reported as "up"? Recompiled with apr-1.4.5 (originally 1.2.7), no effect (another posting had this as a problem) OS = Centos 5.5 and 5.6 ganglia 3.2.0 confuse 2.7 pcre 8.13 rrdtool 1.4.4 No Ipv6 network: all on the same switch Per this posting by Avani Sharma there is a problem with older versions of apr: http://sourceforge.net/mailarchive/message.php?msg_id=27794074 [root@lando ganglia]# ldd /usr/local/sbin/gmond | grep apr libapr-1.so.0 => /usr/local/apr/lib/libapr-1.so.0 (0x00002b7fa3feb000) [root@lando ganglia# /usr/local/apr/bin/apr-1-config --version 1.4.5 /etc/gmond.conf (from system lando) http://pastebin.com/d1XcBs18 only changed clustername/port, and added the other hosts to unicast. The other nodes have this node to unicast. ******Here is tcpdump when the systems are marked as down, from the “gathering gmond node”: ********** [root@lando ziggy]# /usr/sbin/tcpdump -i any ip multicast tcpdump: WARNING: Promiscuous mode not supported on the "any" device tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 96 bytes 09:54:13.273830 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 09:54:13.273851 IP lando.33023 > 239.2.11.71.8641: UDP, length 44 09:54:13.273870 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 09:54:13.273890 IP lando.33023 > 239.2.11.71.8641: UDP, length 44 09:54:13.273909 IP lando.33023 > 239.2.11.71.8641: UDP, length 44 09:54:13.273920 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 09:54:13.273935 IP lando.33023 > 239.2.11.71.8641: UDP, length 44 09:54:33.274233 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 09:54:33.274255 IP lando.33023 > 239.2.11.71.8641: UDP, length 44 09:54:33.274277 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 09:54:33.274295 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 09:54:33.274314 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 09:54:33.274334 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 09:54:33.274365 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 09:54:34.274377 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 09:54:35.274753 IP lando.33023 > 239.2.11.71.8641: UDP, length 48 [snip] ********** Then i restart gmond on the gathering gmond which makes things all better********** [root@lando ziggy]# /sbin/service gmond restart Shutting down GANGLIA gmond: [ OK ] Starting GANGLIA gmond: [ OK ] [root@lando ziggy]# ******Here is tcpdump when the systems are marked as up: ********** [root@lando ziggy]# /usr/sbin/tcpdump -i any ip multicast tcpdump: WARNING: Promiscuous mode not supported on the "any" device tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 96 bytes 09:56:47.346753 IP yoda.46045 > 239.2.11.71.8641: UDP, length 176 09:56:47.346801 IP yoda.46045 > 239.2.11.71.8641: UDP, length 48 09:56:47.346807 IP yoda.46045 > 239.2.11.71.8641: UDP, length 204 09:56:47.346857 IP yoda.46045 > 239.2.11.71.8641: UDP, length 48 09:56:47.347078 IP lando.58895 > 239.2.11.71.8641: UDP, length 28 09:56:47.347298 IP lando.58895 > 239.2.11.71.8641: UDP, length 32 09:56:50.289075 IP han.51957 > 239.2.11.71.8641: UDP, length 52 09:56:50.289084 IP han.51957 > 239.2.11.71.8641: UDP, length 56 | 09:56:50.289084 IP han.51957 > 239.2.11.71.8641: UDP, length 56 09:56:50.289096 IP han.51957 > 239.2.11.71.8641: UDP, length 56 09:56:50.289099 IP han.51957 > 239.2.11.71.8641: UDP, length 56 09:56:50.289200 IP yoda.46045 > 239.2.11.71.8641: UDP, length 28 [snip] Approximately 6 minutes later gmond stops listening and only listens to itself again. Restarting gmond on the other nodes has no effect on the listening of the gathering-gmond. It's always ~6 minutes, never 5 or 10. [root@lando ganglia]# telnet localhost 8641 | grep 192.168 <HOST NAME="luke" IP="192.168.1.1" TAGS="" REPORTED="1315501243" TN="8" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1315499617"> <HOST NAME="yoda" IP="192.168.1.2" TAGS="" REPORTED="1315501250" TN="1" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1315500568"> <HOST NAME="han" IP="192.168.1.7" TAGS="" REPORTED="1315501250" TN="1" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1315499611"> <HOST NAME="lando" IP="192.168.1.8" TAGS="" REPORTED="1315501250" TN="1" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1315501003"> Connection closed by foreign host.
------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general