Re: [Ganglia-general] [Ganglia-developers] Adding Holt-Winters databases to existing rrd causes __SummaryInfo__ metric to fail to render on graphs

2012-10-25 Thread Aaron Nichols
On Wed, Oct 24, 2012 at 9:13 AM, Vladimir Vuksan vli...@veus.hr wrote: I don't have a lot of time to look into it however different between SummaryInfo RRDs and other RRDs is that SummaryInfo contains ds[num] which is the number of nodes that being summarized. I wonder if that is somehow

Re: [Ganglia-general] Question about scaling

2012-10-25 Thread Potter,Mark L
Hi Mark, I assume cnode340 is the head node that all ~340 other gmond's send their data to. If so, you could reduce the amount of redundant metadata flying around by increasing send_metadata_interval to 120 seconds or higher. That is correct, cnode340 is the head node for ganglia. I have

Re: [Ganglia-general] Question about scaling

2012-10-25 Thread Potter,Mark L
Well things blew up ~184 hosts. The web interface shows a random number of hosts down each refresh, although sometimes there are all up. It reports just ~1 second to download and process the XML: Downloading and parsing ganglia's XML tree took 0.9751s. So I don't think timeouts are the problem.

Re: [Ganglia-general] Question about scaling

2012-10-25 Thread Potter,Mark L
Nicholas, I have it set to collect every 60 seconds at the moment as per the gmetad I posted yesterday but even with that, running netstat -ua in a 1 second watch loop, once Recv-Q pops it is still responding immediately and the Recv-Q never stays lit, so to speak, for more than two seconds.

Re: [Ganglia-general] Question about scaling

2012-10-25 Thread Vladimir Vuksan
60 seconds is likely the problem. I would leave it at default ie 15. I can explain later. Potter,Mark L mlpot...@mdanderson.org wrote: Nicholas, I have it set to collect every 60 seconds at the moment as per the gmetad I posted yesterday but even with that, running netstat -ua in a 1 second

Re: [Ganglia-general] Question about scaling

2012-10-25 Thread Potter,Mark L
Vladimir, It is still reporting random nodes as down with gmetad set to collect every 15 seconds. Unfortunately I have to be done with this for today but will be back at it first thing in the morning (CDT). I have also made sure nothing else is running on this box. At the moment it's just