Hi, I have a rather large set of machines I have ganglia watch (~6000), and am trying to build out a resilient infrastructure. I ran into an interesting problem.
I am using gmond version 3.0.2.200511011714 (as reported by --version) Basic layout - each location (~2000 machines) has a pair of hosts to which they send their metrics (unicast). There are a pair of machines that connect to gmond on each of the edge collectors and centralize the data (they connect via TCP to port 8649). We also have another pair of machines that connect to each edge gmond and grab the current XML dump for integration with Nagios (the script is called parse_ganglia for future reference). This worked nicely for quite a while, until one of our edge hosts got too many reportees. There was a connection timeout in parse_ganglia of 5 seconds, so that when one of the edge hosts was down it would move on to the other edge hosts quickly rather than waiting 60s for the down host. When one of the hosts got too many reportees, it started to take ~6s to transfer all the data. At this point, one or the other of the pair of hosts running parse_ganglia started failing on the edge host that had too many reportees. Using tcpdump, I found that though gmond was accepting the connection from both of them, it would only send data to one at a time, and it complete sending data to the first before moving on to the second. so: * host a connects * host a starts getting data * host b connects (3-way handshake complete) but no data flows * host a finishes sending data * host b starts getting data * host b finishes getting data We solved the immediate problem by increasing the timeout from 5 to 15s., but I was a little surprised that gmond behaved in this seemingly-single-threaded manner. While it's easy for us to adjust the timeout in our python parse_ganglia, it is not so easy to poke at gmetad, and I am worried about what will happen when we have variations in network quality, more hosts requesting metrics, etc. Is it true that gmond is single threaded in its network operations? Or maybe just the listener? What other effects might this have? Would it make sense to change gmond so it passes off dumping the XML feed to a child thread so that multiple simultaneous connections can be handled? Thanks for your time, -ben -- Ben Hartshorne email: [EMAIL PROTECTED] http://ben.hartshorne.net
signature.asc
Description: Digital signature
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general