Howdy Gangliati,

I'm having a strange problem that seems to be with multicast, but I'm
not really sure.  I had a very similar problem in the past and posted
here about it...that turned out to a be a problem on one of the network
switches, but my network team insists that this is not the same issue.

I'm running Ganglia 2.5.4 (need to upgrade, I know) on about 16
different clusters/subnets of ~200 hosts each.  Each subnet has a
"control" host that also runs gmond as well as named, ypserv, dhcpd,
etc.  I have a central monitoring host that is dedicated to running
gmetad and the webfrontend that talks to the 16 different control nodes.
Hope that makes sense.  We've been running this way with no major
trouble for quite a while.

I recently brought a new subnet/cluster online, and now I'm having
trouble.  The control box on this subnet seems to be isolated from the
rest.  gstat --all only shows itself, not the rest of the subnet.  The
rest of the subnet sees everything except this control box.

I've rebooted all the machines as well as restarted all gmonds several
times.  When you first start up gmond on the control box, it only sees
itself...then some random amount of time later, it will list the other
nodes in the subnet as being dead.  Similarly, the other hosts report
the control box as being dead.  I can point my gmetad to a random node
in the subnet, and that works fine...I just can't get the control box to
be part of the cluster.  So it seems to me that they do communicate at
some point to at least populate the dead list.  I've done tcpdumps
looking for multicast traffic between the control box and the rest, but
nothing ever shows up.

The control box is on a different physical network segment...the nodes
are plugged into 48-port Cisco switches (100 Mb), and those have a GigE
connection back to a big Cisco 6500.  The control box has a direct GigE
connection to the 6500.  Same deal as with all our other subnets.

I'm no network whiz, but I've had our network team beating their heads
against this, and they insist there is nothing wrong on their end.
Anyone else have any ideas?  Thanks!


Steve Gilbert
Unix Systems Administrator
[EMAIL PROTECTED]

Reply via email to