Hi Peter,

Thanks for the feedback.

I've added a thread mutex to the hosts hash table as you suggested and will
send a pull request in the next day or so.

Regards,
Nick

On Mon, Sep 17, 2012 at 8:25 PM, Peter Phaal <peter.ph...@gmail.com> wrote:

> Nicholas,
>
> It makes sense to multi-thread gmond, but looking at your patch, I
> don't see any locking associated with the hosts hashtable. Isn't there
> a possible race if new hosts/metrics are added to the hashtable by the
> UDP thread at the same time the hashtable is being walked by the TCP
> thread?
>
> Peter
>
> On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly <nfsatte...@gmail.com>
> wrote:
> > Hi Chris,
> >
> > I've discovered there are two contributing factors to problems like this.
> >
> > 1. the number of metrics being sent (possibly in short bursts) can
> overflow
> > the UDP receive buffer.
> > 2. the time it takes to process metrics in the UDP receive buffer causes
> TCP
> > connections from the gmetad's to timeout (currently hard-coded to 10
> > seconds)
> >
> > In your case, you are probably dropping UDP packets because gmond can't
> keep
> > up. Gmond was enhanced to allow you to increase the UDP buffer size back
> in
> > April. I suggest you upgrade to the latest version and set this a
> sensible
> > value for your environment.
> >
> > udp_recv_channel {
> >   port = 1234
> >   buffer = 1024000
> > }
> >
> > To determine what is sensible is a bit of trial and error. Run "netstat
> -su"
> > and keep increasing the value until you no longer see the number of
> "packet
> > receive errors" going up.
> >
> > $ netstat -su
> > Udp:
> >     7941393 packets received
> >     23 packets to unknown port received.
> >     0 packet receive errors
> >     10079118 packets sent
> >
> > The other possibility is that it takes so long for a gmetad to pull back
> all
> > the metrics you are collecting for a cluster that you are preventing the
> > gmond from processing metric data received via UDP. Again this can cause
> the
> > UDP receive buffer to overflow.
> >
> > The problem we had at my work is related to all of the above but
> manifested
> > itself in a slightly different way. We were seeing gaps in all our graphs
> > because at times none of the servers in a cluster would respond to gmetad
> > poll within 10 seconds. I used to think that the gmond was completely
> hung
> > but realised that they would respond normally most of the time but every
> > minute or so it woul take about 20-25 seconds. This happened to coincide
> > with the UDP receive queue growing ("Recv-Q" column below) and I realised
> > that it took this long for the gmond to process the metric data it had
> > received via UDP from all the other servers in the cluster.
> >
> > $ netstat -ua
> > Active Internet connections (servers and established)
> > Proto Recv-Q Send-Q Local Address
> > udp   1920032      0 *:8649                      *:*
> >
> > The solution was to modify gmond and move the TCP request handler into to
> > separate thread so that gmond could take as long as it needed to process
> > incoming metric data (from UDP receive buffer that is large enough not to
> > overflow) without blocking on the TCP requests for the XML data.
> >
> > The patched gmond is running without a problem in our environment so I
> have
> > submitted a pull request[1] for it to be included in trunk.
> >
> > I can't be 100% sure that this patch will fix your problem but it would
> be
> > worth a try.
> >
> > Regards,
> > Nick
> >
> > [1] https://github.com/ganglia/monitor-core/pull/50
> >
> >
> > On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs
> > <chris.burrou...@gmail.com> wrote:
> >>
> >> We use ganglia to monitor > 500 hosts in multiple datacenters with about
> >> 90k unique host:metric pairs per DC.  We use this data for all of the
> >> cool graphs in the web UI and for passive alerting.
> >>
> >> One of our checks is to measure TN of load_one on every box (we want to
> >> make sure gmond is working and correctly updating metrics otherwise we
> >> could be blind and not know it).  We consider it a failure if TN is >
> >> 600.  This is an arbitrary number but 10 minutes seemed plenty long.
> >>
> >> Unfortunately we are seeing this check fail far too often.  We set up
> >> two parallel gmetad instances (monitoring identical gmonds) per DC and
> >> have broken our problem into two classes:
> >>  * (A) only one of the gmetad stops updating for an entire cluster, and
> >> must be restarted to recover.  Since the gmetad's disagree we know the
> >> problem is there. [1]
> >>  * (B) Both gmetad's say an individual host has not reported (gmond
> >> aggregation or sending must be at fault).  This issue is usually
> >> transient (that is it recovers after some period of time greater than 10
> >> minutes).
> >>
> >> While attempting to reproduce (A) we ran several additional gmetad
> >> instances (again polling the same gmonds) around 2012-12-07.  Failures
> >> per day are below [2].  The act of testing seems to have significantly
> >> increased the number of failures.
> >>
> >> This lead us to consider if the act of polling a gmond aggregator could
> >> impact the ability for it to concurrently collect metrics.  We looked at
> >> the code but are not experienced with concurrent programming in C.
> >> Could someone with more familiarity with the gmond code comment as to if
> >> this is likely  to be a worthwhile avenue of investigation?  We are also
> >> looking to for suggestion for an empirical test to rule this out.
> >>
> >> (Of course, other comments on the root "TN goes up, metrics stop
> >> updating" sporadic problem are also welcome!)
> >>
> >> Thank you,
> >> Chris Burroughs
> >>
> >>
> >> [1] https://github.com/ganglia/monitor-core/issues/47
> >>
> >> [2]
> >> 120827  89
> >> 120828  6
> >> 120829  3
> >> 120830  4
> >> 120831  5
> >> 120901  1
> >> 120902  6
> >> 120903  2
> >> 120904  9
> >> 120905  4
> >> 120906  70
> >> 120907  523
> >> 120908  85
> >> 120909  4
> >> 120910  6
> >> 120911  2
> >> 120912  5
> >> 120913  5
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Got visibility?
> >> Most devs has no idea what their production app looks like.
> >> Find out how fast your code is with AppDynamics Lite.
> >> http://ad.doubleclick.net/clk;262219671;13503038;y?
> >> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> >> _______________________________________________
> >> Ganglia-general mailing list
> >> Ganglia-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Live Security Virtual Conference
> > Exclusive live event will cover all the ways today's security and
> > threat landscape has changed and how IT managers can respond. Discussions
> > will include endpoint security, mobile security and the latest in malware
> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> > _______________________________________________
> > Ganglia-general mailing list
> > Ganglia-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
>
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to