Re: [Ganglia-general] Question about scaling

Nicholas Satterly Thu, 25 Oct 2012 13:21:45 -0700

Hi Mark,

I wouldn't be so quick to dismiss timeouts as the problem. The "0.9751s" it
took to download and parse ganglia's XML tree refers to the time it took
the PHP web frontend to query the gmetad XML whereas the timeout's I was
referring to occur when the gmetad polls the gmonds during metric
collection every 15 seconds.


My suggestion would be to run "netstat -ua" in a loop on the head node and
look for a non-zero "Recv-Q" on UDP port 8649. As soon as you see it go
non-zero telnet to port 8649 on the head node and make note of how long it
takes to respond. If it's any longer than 10 seconds you will see random
hosts down and broken graphs on the ganglia web.

--Nick.

On Thu, Oct 25, 2012 at 8:30 PM, Potter,Mark L <[email protected]>wrote:

> Well things blew up ~184 hosts. The web interface shows a random number of
> hosts down each refresh, although sometimes there are all up. It reports
> just ~1 second to download and process the XML: "Downloading and parsing
> ganglia's XML tree took 0.9751s " So I don't think timeouts are the
> problem. A telnet to 8649 produces the XLM immediately. Could this be the
> point where I need start using a RAM based partition or could it be
> something else. Is sflow so much better I should consider using it? Would
> multiple gmond's, say one per rack, and listing them all in gmetad be a
> solution? At this point I am not sure of the next step and I really
> appreciate the help the list have given me so far.
>
>
>
> >Hi Mark,
> >
> >I assume cnode340 is the head node that all ~340 other gmond's send their
> data to. If so, you could reduce >the amount of redundant metadata flying
> around by increasing "send_metadata_interval" to 120 seconds or
> >higher.
>
> That is correct, cnode340 is the head node for ganglia. I have increased
> the "send metadata interval" to 120 seconds and have 100 nodes reporting at
> this point and it seems pretty smooth. I am going to add the others ~50 at
> a time.
>
> >Also, I suspect that if you telnet to port 8649 on your head node it will
> take a while to respond because >it's busy processing incoming UDP metrics.
> If it takes more than 10 seconds to respond on a regular basis >then gmetad
> will timeout [1].
>
> So far, with the 100 I have the response is an instant dump of the XML.
>
> >Try deploying a recently patched version of gmond [2] to the head node
> which is now multi-threaded and see >if that fixes the problem. It starts a
> separate thread for responding to XML metric requests and should >respond
> immediately while the main thread is still processing metrics.
>
> I am running:
>
> gmond 3.4.0
> gmetad 3.4.0
> Ganglia Web Frontend version 3.5.2
>
> Would I need to patch gmond at this version?
>
>
> <SNIP>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_sfd2d_oct
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct

_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Question about scaling

Reply via email to