Re: [Ganglia-general] Impact of gmond polling on data collection

2012-09-21 Thread Chris Burroughs
Thanks very much Nicholas.  Your reply was very helpful and we are going
to try out your settings changes and patches.

On 09/17/2012 09:03 AM, Nicholas Satterly wrote:
 Hi Chris,
 
 I've discovered there are two contributing factors to problems like this.
 
 1. the number of metrics being sent (possibly in short bursts) can overflow
 the UDP receive buffer.
 2. the time it takes to process metrics in the UDP receive buffer causes
 TCP connections from the gmetad's to timeout (currently hard-coded to 10
 seconds)
 
 In your case, you are probably dropping UDP packets because gmond can't
 keep up. Gmond was enhanced to allow you to increase the UDP buffer size
 back in April. I suggest you upgrade to the latest version and set this a
 sensible value for your environment.
 
 udp_recv_channel {
   port = 1234
   buffer = 1024000
 }
 
 To determine what is sensible is a bit of trial and error. Run netstat
 -su and keep increasing the value until you no longer see the number of
 packet receive errors going up.
 
 $ netstat -su
 Udp:
 7941393 packets received
 23 packets to unknown port received.
 0 packet receive errors
 10079118 packets sent
 
 The other possibility is that it takes so long for a gmetad to pull back
 all the metrics you are collecting for a cluster that you are preventing
 the gmond from processing metric data received via UDP. Again this can
 cause the UDP receive buffer to overflow.
 
 The problem we had at my work is related to all of the above but manifested
 itself in a slightly different way. We were seeing gaps in all our graphs
 because at times none of the servers in a cluster would respond to gmetad
 poll within 10 seconds. I used to think that the gmond was completely hung
 but realised that they would respond normally most of the time but every
 minute or so it woul take about 20-25 seconds. This happened to coincide
 with the UDP receive queue growing (Recv-Q column below) and I realised
 that it took this long for the gmond to process the metric data it had
 received via UDP from all the other servers in the cluster.
 
 $ netstat -ua
 Active Internet connections (servers and established)
 Proto Recv-Q Send-Q Local Address
 udp   1920032  0 *:8649  *:*
 
 
 The solution was to modify gmond and move the TCP request handler into to
 separate thread so that gmond could take as long as it needed to process
 incoming metric data (from UDP receive buffer that is large enough not to
 overflow) without blocking on the TCP requests for the XML data.
 
 The patched gmond is running without a problem in our environment so I have
 submitted a pull request[1] for it to be included in trunk.
 
 I can't be 100% sure that this patch will fix your problem but it would be
 worth a try.
 
 Regards,
 Nick
 
 [1] https://github.com/ganglia/monitor-core/pull/50
 
 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com
 wrote:
 
 We use ganglia to monitor  500 hosts in multiple datacenters with about
 90k unique host:metric pairs per DC.  We use this data for all of the
 cool graphs in the web UI and for passive alerting.

 One of our checks is to measure TN of load_one on every box (we want to
 make sure gmond is working and correctly updating metrics otherwise we
 could be blind and not know it).  We consider it a failure if TN is 
 600.  This is an arbitrary number but 10 minutes seemed plenty long.

 Unfortunately we are seeing this check fail far too often.  We set up
 two parallel gmetad instances (monitoring identical gmonds) per DC and
 have broken our problem into two classes:
  * (A) only one of the gmetad stops updating for an entire cluster, and
 must be restarted to recover.  Since the gmetad's disagree we know the
 problem is there. [1]
  * (B) Both gmetad's say an individual host has not reported (gmond
 aggregation or sending must be at fault).  This issue is usually
 transient (that is it recovers after some period of time greater than 10
 minutes).

 While attempting to reproduce (A) we ran several additional gmetad
 instances (again polling the same gmonds) around 2012-12-07.  Failures
 per day are below [2].  The act of testing seems to have significantly
 increased the number of failures.

 This lead us to consider if the act of polling a gmond aggregator could
 impact the ability for it to concurrently collect metrics.  We looked at
 the code but are not experienced with concurrent programming in C.
 Could someone with more familiarity with the gmond code comment as to if
 this is likely  to be a worthwhile avenue of investigation?  We are also
 looking to for suggestion for an empirical test to rule this out.

 (Of course, other comments on the root TN goes up, metrics stop
 updating sporadic problem are also welcome!)

 Thank you,
 Chris Burroughs


 [1] https://github.com/ganglia/monitor-core/issues/47

 [2]
 120827  89
 120828  6
 120829  3
 120830  4
 120831  5
 120901  1
 120902  6
 120903  2
 120904  9
 120905  4
 

Re: [Ganglia-general] Impact of gmond polling on data collection

2012-09-19 Thread Nicholas Satterly
Hi Peter,

Thanks for the feedback.

I've added a thread mutex to the hosts hash table as you suggested and will
send a pull request in the next day or so.

Regards,
Nick

On Mon, Sep 17, 2012 at 8:25 PM, Peter Phaal peter.ph...@gmail.com wrote:

 Nicholas,

 It makes sense to multi-thread gmond, but looking at your patch, I
 don't see any locking associated with the hosts hashtable. Isn't there
 a possible race if new hosts/metrics are added to the hashtable by the
 UDP thread at the same time the hashtable is being walked by the TCP
 thread?

 Peter

 On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com
 wrote:
  Hi Chris,
 
  I've discovered there are two contributing factors to problems like this.
 
  1. the number of metrics being sent (possibly in short bursts) can
 overflow
  the UDP receive buffer.
  2. the time it takes to process metrics in the UDP receive buffer causes
 TCP
  connections from the gmetad's to timeout (currently hard-coded to 10
  seconds)
 
  In your case, you are probably dropping UDP packets because gmond can't
 keep
  up. Gmond was enhanced to allow you to increase the UDP buffer size back
 in
  April. I suggest you upgrade to the latest version and set this a
 sensible
  value for your environment.
 
  udp_recv_channel {
port = 1234
buffer = 1024000
  }
 
  To determine what is sensible is a bit of trial and error. Run netstat
 -su
  and keep increasing the value until you no longer see the number of
 packet
  receive errors going up.
 
  $ netstat -su
  Udp:
  7941393 packets received
  23 packets to unknown port received.
  0 packet receive errors
  10079118 packets sent
 
  The other possibility is that it takes so long for a gmetad to pull back
 all
  the metrics you are collecting for a cluster that you are preventing the
  gmond from processing metric data received via UDP. Again this can cause
 the
  UDP receive buffer to overflow.
 
  The problem we had at my work is related to all of the above but
 manifested
  itself in a slightly different way. We were seeing gaps in all our graphs
  because at times none of the servers in a cluster would respond to gmetad
  poll within 10 seconds. I used to think that the gmond was completely
 hung
  but realised that they would respond normally most of the time but every
  minute or so it woul take about 20-25 seconds. This happened to coincide
  with the UDP receive queue growing (Recv-Q column below) and I realised
  that it took this long for the gmond to process the metric data it had
  received via UDP from all the other servers in the cluster.
 
  $ netstat -ua
  Active Internet connections (servers and established)
  Proto Recv-Q Send-Q Local Address
  udp   1920032  0 *:8649  *:*
 
  The solution was to modify gmond and move the TCP request handler into to
  separate thread so that gmond could take as long as it needed to process
  incoming metric data (from UDP receive buffer that is large enough not to
  overflow) without blocking on the TCP requests for the XML data.
 
  The patched gmond is running without a problem in our environment so I
 have
  submitted a pull request[1] for it to be included in trunk.
 
  I can't be 100% sure that this patch will fix your problem but it would
 be
  worth a try.
 
  Regards,
  Nick
 
  [1] https://github.com/ganglia/monitor-core/pull/50
 
 
  On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs
  chris.burrou...@gmail.com wrote:
 
  We use ganglia to monitor  500 hosts in multiple datacenters with about
  90k unique host:metric pairs per DC.  We use this data for all of the
  cool graphs in the web UI and for passive alerting.
 
  One of our checks is to measure TN of load_one on every box (we want to
  make sure gmond is working and correctly updating metrics otherwise we
  could be blind and not know it).  We consider it a failure if TN is 
  600.  This is an arbitrary number but 10 minutes seemed plenty long.
 
  Unfortunately we are seeing this check fail far too often.  We set up
  two parallel gmetad instances (monitoring identical gmonds) per DC and
  have broken our problem into two classes:
   * (A) only one of the gmetad stops updating for an entire cluster, and
  must be restarted to recover.  Since the gmetad's disagree we know the
  problem is there. [1]
   * (B) Both gmetad's say an individual host has not reported (gmond
  aggregation or sending must be at fault).  This issue is usually
  transient (that is it recovers after some period of time greater than 10
  minutes).
 
  While attempting to reproduce (A) we ran several additional gmetad
  instances (again polling the same gmonds) around 2012-12-07.  Failures
  per day are below [2].  The act of testing seems to have significantly
  increased the number of failures.
 
  This lead us to consider if the act of polling a gmond aggregator could
  impact the ability for it to concurrently collect metrics.  We looked at
  the code but are not experienced 

Re: [Ganglia-general] Impact of gmond polling on data collection

2012-09-19 Thread Peter Phaal
Nick,

I think you probably need two mutexes if you want to avoid blocking
the UDP thread unnecessarily.

1. a mutex on the hastable that must be grabbed by the TCP thread when
it walks the hash table and the UDP thread would grab it any time it
adds or removes an entry from the hash table.
2. a mutex used to control access to individual entries in the
hashtable. The TCP thread would grap and release this mutex for each
entry as it walks the hash table. The UDP thread would grab this mutex
each time it updates an entry.

The only situation in which this locking scheme would block the UDP
thread for any significant time is when a new host starts sending
metrics and a new entry needs to be added to the hash table. This is a
rare event and not much of a concern. The TCP thread should never have
to wait long to acquire either of the mutexes.

Peter

On Wed, Sep 19, 2012 at 8:45 AM, Nicholas Satterly nfsatte...@gmail.com wrote:
 Hi Peter,

 Thanks for the feedback.

 I've added a thread mutex to the hosts hash table as you suggested and will
 send a pull request in the next day or so.

 Regards,
 Nick


 On Mon, Sep 17, 2012 at 8:25 PM, Peter Phaal peter.ph...@gmail.com wrote:

 Nicholas,

 It makes sense to multi-thread gmond, but looking at your patch, I
 don't see any locking associated with the hosts hashtable. Isn't there
 a possible race if new hosts/metrics are added to the hashtable by the
 UDP thread at the same time the hashtable is being walked by the TCP
 thread?

 Peter

 On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com
 wrote:
  Hi Chris,
 
  I've discovered there are two contributing factors to problems like
  this.
 
  1. the number of metrics being sent (possibly in short bursts) can
  overflow
  the UDP receive buffer.
  2. the time it takes to process metrics in the UDP receive buffer causes
  TCP
  connections from the gmetad's to timeout (currently hard-coded to 10
  seconds)
 
  In your case, you are probably dropping UDP packets because gmond can't
  keep
  up. Gmond was enhanced to allow you to increase the UDP buffer size back
  in
  April. I suggest you upgrade to the latest version and set this a
  sensible
  value for your environment.
 
  udp_recv_channel {
port = 1234
buffer = 1024000
  }
 
  To determine what is sensible is a bit of trial and error. Run netstat
  -su
  and keep increasing the value until you no longer see the number of
  packet
  receive errors going up.
 
  $ netstat -su
  Udp:
  7941393 packets received
  23 packets to unknown port received.
  0 packet receive errors
  10079118 packets sent
 
  The other possibility is that it takes so long for a gmetad to pull back
  all
  the metrics you are collecting for a cluster that you are preventing the
  gmond from processing metric data received via UDP. Again this can cause
  the
  UDP receive buffer to overflow.
 
  The problem we had at my work is related to all of the above but
  manifested
  itself in a slightly different way. We were seeing gaps in all our
  graphs
  because at times none of the servers in a cluster would respond to
  gmetad
  poll within 10 seconds. I used to think that the gmond was completely
  hung
  but realised that they would respond normally most of the time but every
  minute or so it woul take about 20-25 seconds. This happened to coincide
  with the UDP receive queue growing (Recv-Q column below) and I
  realised
  that it took this long for the gmond to process the metric data it had
  received via UDP from all the other servers in the cluster.
 
  $ netstat -ua
  Active Internet connections (servers and established)
  Proto Recv-Q Send-Q Local Address
  udp   1920032  0 *:8649  *:*
 
  The solution was to modify gmond and move the TCP request handler into
  to
  separate thread so that gmond could take as long as it needed to process
  incoming metric data (from UDP receive buffer that is large enough not
  to
  overflow) without blocking on the TCP requests for the XML data.
 
  The patched gmond is running without a problem in our environment so I
  have
  submitted a pull request[1] for it to be included in trunk.
 
  I can't be 100% sure that this patch will fix your problem but it would
  be
  worth a try.
 
  Regards,
  Nick
 
  [1] https://github.com/ganglia/monitor-core/pull/50
 
 
  On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs
  chris.burrou...@gmail.com wrote:
 
  We use ganglia to monitor  500 hosts in multiple datacenters with
  about
  90k unique host:metric pairs per DC.  We use this data for all of the
  cool graphs in the web UI and for passive alerting.
 
  One of our checks is to measure TN of load_one on every box (we want to
  make sure gmond is working and correctly updating metrics otherwise we
  could be blind and not know it).  We consider it a failure if TN is 
  600.  This is an arbitrary number but 10 minutes seemed plenty long.
 
  Unfortunately we are seeing this check fail far 

Re: [Ganglia-general] Impact of gmond polling on data collection

2012-09-19 Thread Nicholas Satterly
Hi Peter,

I've submitted another pull request covering a mutex for the hostdata hash
table.

Thanks again for your guidance.

Regards,
Nick

On Wed, Sep 19, 2012 at 5:53 PM, Peter Phaal peter.ph...@gmail.com wrote:

 Nick,

 I think you probably need two mutexes if you want to avoid blocking
 the UDP thread unnecessarily.

 1. a mutex on the hastable that must be grabbed by the TCP thread when
 it walks the hash table and the UDP thread would grab it any time it
 adds or removes an entry from the hash table.


https://github.com/ganglia/monitor-core/pull/51


 2. a mutex used to control access to individual entries in the
 hashtable. The TCP thread would grap and release this mutex for each
 entry as it walks the hash table. The UDP thread would grab this mutex
 each time it updates an entry.


https://github.com/ganglia/monitor-core/pull/52


 The only situation in which this locking scheme would block the UDP
 thread for any significant time is when a new host starts sending
 metrics and a new entry needs to be added to the hash table. This is a
 rare event and not much of a concern. The TCP thread should never have
 to wait long to acquire either of the mutexes.

 Peter

 On Wed, Sep 19, 2012 at 8:45 AM, Nicholas Satterly nfsatte...@gmail.com
 wrote:
  Hi Peter,
 
  Thanks for the feedback.
 
  I've added a thread mutex to the hosts hash table as you suggested and
 will
  send a pull request in the next day or so.
 
  Regards,
  Nick
 
 
  On Mon, Sep 17, 2012 at 8:25 PM, Peter Phaal peter.ph...@gmail.com
 wrote:
 
  Nicholas,
 
  It makes sense to multi-thread gmond, but looking at your patch, I
  don't see any locking associated with the hosts hashtable. Isn't there
  a possible race if new hosts/metrics are added to the hashtable by the
  UDP thread at the same time the hashtable is being walked by the TCP
  thread?
 
  Peter
 
  On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly 
 nfsatte...@gmail.com
  wrote:
   Hi Chris,
  
   I've discovered there are two contributing factors to problems like
   this.
  
   1. the number of metrics being sent (possibly in short bursts) can
   overflow
   the UDP receive buffer.
   2. the time it takes to process metrics in the UDP receive buffer
 causes
   TCP
   connections from the gmetad's to timeout (currently hard-coded to 10
   seconds)
  
   In your case, you are probably dropping UDP packets because gmond
 can't
   keep
   up. Gmond was enhanced to allow you to increase the UDP buffer size
 back
   in
   April. I suggest you upgrade to the latest version and set this a
   sensible
   value for your environment.
  
   udp_recv_channel {
 port = 1234
 buffer = 1024000
   }
  
   To determine what is sensible is a bit of trial and error. Run
 netstat
   -su
   and keep increasing the value until you no longer see the number of
   packet
   receive errors going up.
  
   $ netstat -su
   Udp:
   7941393 packets received
   23 packets to unknown port received.
   0 packet receive errors
   10079118 packets sent
  
   The other possibility is that it takes so long for a gmetad to pull
 back
   all
   the metrics you are collecting for a cluster that you are preventing
 the
   gmond from processing metric data received via UDP. Again this can
 cause
   the
   UDP receive buffer to overflow.
  
   The problem we had at my work is related to all of the above but
   manifested
   itself in a slightly different way. We were seeing gaps in all our
   graphs
   because at times none of the servers in a cluster would respond to
   gmetad
   poll within 10 seconds. I used to think that the gmond was completely
   hung
   but realised that they would respond normally most of the time but
 every
   minute or so it woul take about 20-25 seconds. This happened to
 coincide
   with the UDP receive queue growing (Recv-Q column below) and I
   realised
   that it took this long for the gmond to process the metric data it had
   received via UDP from all the other servers in the cluster.
  
   $ netstat -ua
   Active Internet connections (servers and established)
   Proto Recv-Q Send-Q Local Address
   udp   1920032  0 *:8649  *:*
  
   The solution was to modify gmond and move the TCP request handler into
   to
   separate thread so that gmond could take as long as it needed to
 process
   incoming metric data (from UDP receive buffer that is large enough not
   to
   overflow) without blocking on the TCP requests for the XML data.
  
   The patched gmond is running without a problem in our environment so I
   have
   submitted a pull request[1] for it to be included in trunk.
  
   I can't be 100% sure that this patch will fix your problem but it
 would
   be
   worth a try.
  
   Regards,
   Nick
  
   [1] https://github.com/ganglia/monitor-core/pull/50
  
  
   On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs
   chris.burrou...@gmail.com wrote:
  
   We use ganglia to monitor  500 hosts in multiple datacenters with
   about

Re: [Ganglia-general] Impact of gmond polling on data collection

2012-09-19 Thread Neil Mckee
in gmond.c:process_tc_accept_channel() could those goto statements close the 
socket and return without relinquishing the mutex?

Neil

On Sep 19, 2012, at 8:45 AM, Nicholas Satterly wrote:

 Hi Peter,
 
 Thanks for the feedback.
 
 I've added a thread mutex to the hosts hash table as you suggested and will 
 send a pull request in the next day or so.
 
 Regards,
 Nick
 
 On Mon, Sep 17, 2012 at 8:25 PM, Peter Phaal peter.ph...@gmail.com wrote:
 Nicholas,
 
 It makes sense to multi-thread gmond, but looking at your patch, I
 don't see any locking associated with the hosts hashtable. Isn't there
 a possible race if new hosts/metrics are added to the hashtable by the
 UDP thread at the same time the hashtable is being walked by the TCP
 thread?
 
 Peter
 
 On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com 
 wrote:
  Hi Chris,
 
  I've discovered there are two contributing factors to problems like this.
 
  1. the number of metrics being sent (possibly in short bursts) can overflow
  the UDP receive buffer.
  2. the time it takes to process metrics in the UDP receive buffer causes TCP
  connections from the gmetad's to timeout (currently hard-coded to 10
  seconds)
 
  In your case, you are probably dropping UDP packets because gmond can't keep
  up. Gmond was enhanced to allow you to increase the UDP buffer size back in
  April. I suggest you upgrade to the latest version and set this a sensible
  value for your environment.
 
  udp_recv_channel {
port = 1234
buffer = 1024000
  }
 
  To determine what is sensible is a bit of trial and error. Run netstat -su
  and keep increasing the value until you no longer see the number of packet
  receive errors going up.
 
  $ netstat -su
  Udp:
  7941393 packets received
  23 packets to unknown port received.
  0 packet receive errors
  10079118 packets sent
 
  The other possibility is that it takes so long for a gmetad to pull back all
  the metrics you are collecting for a cluster that you are preventing the
  gmond from processing metric data received via UDP. Again this can cause the
  UDP receive buffer to overflow.
 
  The problem we had at my work is related to all of the above but manifested
  itself in a slightly different way. We were seeing gaps in all our graphs
  because at times none of the servers in a cluster would respond to gmetad
  poll within 10 seconds. I used to think that the gmond was completely hung
  but realised that they would respond normally most of the time but every
  minute or so it woul take about 20-25 seconds. This happened to coincide
  with the UDP receive queue growing (Recv-Q column below) and I realised
  that it took this long for the gmond to process the metric data it had
  received via UDP from all the other servers in the cluster.
 
  $ netstat -ua
  Active Internet connections (servers and established)
  Proto Recv-Q Send-Q Local Address
  udp   1920032  0 *:8649  *:*
 
  The solution was to modify gmond and move the TCP request handler into to
  separate thread so that gmond could take as long as it needed to process
  incoming metric data (from UDP receive buffer that is large enough not to
  overflow) without blocking on the TCP requests for the XML data.
 
  The patched gmond is running without a problem in our environment so I have
  submitted a pull request[1] for it to be included in trunk.
 
  I can't be 100% sure that this patch will fix your problem but it would be
  worth a try.
 
  Regards,
  Nick
 
  [1] https://github.com/ganglia/monitor-core/pull/50
 
 
  On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs
  chris.burrou...@gmail.com wrote:
 
  We use ganglia to monitor  500 hosts in multiple datacenters with about
  90k unique host:metric pairs per DC.  We use this data for all of the
  cool graphs in the web UI and for passive alerting.
 
  One of our checks is to measure TN of load_one on every box (we want to
  make sure gmond is working and correctly updating metrics otherwise we
  could be blind and not know it).  We consider it a failure if TN is 
  600.  This is an arbitrary number but 10 minutes seemed plenty long.
 
  Unfortunately we are seeing this check fail far too often.  We set up
  two parallel gmetad instances (monitoring identical gmonds) per DC and
  have broken our problem into two classes:
   * (A) only one of the gmetad stops updating for an entire cluster, and
  must be restarted to recover.  Since the gmetad's disagree we know the
  problem is there. [1]
   * (B) Both gmetad's say an individual host has not reported (gmond
  aggregation or sending must be at fault).  This issue is usually
  transient (that is it recovers after some period of time greater than 10
  minutes).
 
  While attempting to reproduce (A) we ran several additional gmetad
  instances (again polling the same gmonds) around 2012-12-07.  Failures
  per day are below [2].  The act of testing seems to have significantly
  increased the number of 

Re: [Ganglia-general] Impact of gmond polling on data collection

2012-09-17 Thread Nicholas Satterly
Hi Chris,

I've discovered there are two contributing factors to problems like this.

1. the number of metrics being sent (possibly in short bursts) can overflow
the UDP receive buffer.
2. the time it takes to process metrics in the UDP receive buffer causes
TCP connections from the gmetad's to timeout (currently hard-coded to 10
seconds)

In your case, you are probably dropping UDP packets because gmond can't
keep up. Gmond was enhanced to allow you to increase the UDP buffer size
back in April. I suggest you upgrade to the latest version and set this a
sensible value for your environment.

udp_recv_channel {
  port = 1234
  buffer = 1024000
}

To determine what is sensible is a bit of trial and error. Run netstat
-su and keep increasing the value until you no longer see the number of
packet receive errors going up.

$ netstat -su
Udp:
7941393 packets received
23 packets to unknown port received.
0 packet receive errors
10079118 packets sent

The other possibility is that it takes so long for a gmetad to pull back
all the metrics you are collecting for a cluster that you are preventing
the gmond from processing metric data received via UDP. Again this can
cause the UDP receive buffer to overflow.

The problem we had at my work is related to all of the above but manifested
itself in a slightly different way. We were seeing gaps in all our graphs
because at times none of the servers in a cluster would respond to gmetad
poll within 10 seconds. I used to think that the gmond was completely hung
but realised that they would respond normally most of the time but every
minute or so it woul take about 20-25 seconds. This happened to coincide
with the UDP receive queue growing (Recv-Q column below) and I realised
that it took this long for the gmond to process the metric data it had
received via UDP from all the other servers in the cluster.

$ netstat -ua
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address
udp   1920032  0 *:8649  *:*


The solution was to modify gmond and move the TCP request handler into to
separate thread so that gmond could take as long as it needed to process
incoming metric data (from UDP receive buffer that is large enough not to
overflow) without blocking on the TCP requests for the XML data.

The patched gmond is running without a problem in our environment so I have
submitted a pull request[1] for it to be included in trunk.

I can't be 100% sure that this patch will fix your problem but it would be
worth a try.

Regards,
Nick

[1] https://github.com/ganglia/monitor-core/pull/50

On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com
 wrote:

 We use ganglia to monitor  500 hosts in multiple datacenters with about
 90k unique host:metric pairs per DC.  We use this data for all of the
 cool graphs in the web UI and for passive alerting.

 One of our checks is to measure TN of load_one on every box (we want to
 make sure gmond is working and correctly updating metrics otherwise we
 could be blind and not know it).  We consider it a failure if TN is 
 600.  This is an arbitrary number but 10 minutes seemed plenty long.

 Unfortunately we are seeing this check fail far too often.  We set up
 two parallel gmetad instances (monitoring identical gmonds) per DC and
 have broken our problem into two classes:
  * (A) only one of the gmetad stops updating for an entire cluster, and
 must be restarted to recover.  Since the gmetad's disagree we know the
 problem is there. [1]
  * (B) Both gmetad's say an individual host has not reported (gmond
 aggregation or sending must be at fault).  This issue is usually
 transient (that is it recovers after some period of time greater than 10
 minutes).

 While attempting to reproduce (A) we ran several additional gmetad
 instances (again polling the same gmonds) around 2012-12-07.  Failures
 per day are below [2].  The act of testing seems to have significantly
 increased the number of failures.

 This lead us to consider if the act of polling a gmond aggregator could
 impact the ability for it to concurrently collect metrics.  We looked at
 the code but are not experienced with concurrent programming in C.
 Could someone with more familiarity with the gmond code comment as to if
 this is likely  to be a worthwhile avenue of investigation?  We are also
 looking to for suggestion for an empirical test to rule this out.

 (Of course, other comments on the root TN goes up, metrics stop
 updating sporadic problem are also welcome!)

 Thank you,
 Chris Burroughs


 [1] https://github.com/ganglia/monitor-core/issues/47

 [2]
 120827  89
 120828  6
 120829  3
 120830  4
 120831  5
 120901  1
 120902  6
 120903  2
 120904  9
 120905  4
 120906  70
 120907  523
 120908  85
 120909  4
 120910  6
 120911  2
 120912  5
 120913  5


 --
 Got visibility?
 Most devs has no idea what their production app 

Re: [Ganglia-general] Impact of gmond polling on data collection

2012-09-17 Thread Peter Phaal
Nicholas,

It makes sense to multi-thread gmond, but looking at your patch, I
don't see any locking associated with the hosts hashtable. Isn't there
a possible race if new hosts/metrics are added to the hashtable by the
UDP thread at the same time the hashtable is being walked by the TCP
thread?

Peter

On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com wrote:
 Hi Chris,

 I've discovered there are two contributing factors to problems like this.

 1. the number of metrics being sent (possibly in short bursts) can overflow
 the UDP receive buffer.
 2. the time it takes to process metrics in the UDP receive buffer causes TCP
 connections from the gmetad's to timeout (currently hard-coded to 10
 seconds)

 In your case, you are probably dropping UDP packets because gmond can't keep
 up. Gmond was enhanced to allow you to increase the UDP buffer size back in
 April. I suggest you upgrade to the latest version and set this a sensible
 value for your environment.

 udp_recv_channel {
   port = 1234
   buffer = 1024000
 }

 To determine what is sensible is a bit of trial and error. Run netstat -su
 and keep increasing the value until you no longer see the number of packet
 receive errors going up.

 $ netstat -su
 Udp:
 7941393 packets received
 23 packets to unknown port received.
 0 packet receive errors
 10079118 packets sent

 The other possibility is that it takes so long for a gmetad to pull back all
 the metrics you are collecting for a cluster that you are preventing the
 gmond from processing metric data received via UDP. Again this can cause the
 UDP receive buffer to overflow.

 The problem we had at my work is related to all of the above but manifested
 itself in a slightly different way. We were seeing gaps in all our graphs
 because at times none of the servers in a cluster would respond to gmetad
 poll within 10 seconds. I used to think that the gmond was completely hung
 but realised that they would respond normally most of the time but every
 minute or so it woul take about 20-25 seconds. This happened to coincide
 with the UDP receive queue growing (Recv-Q column below) and I realised
 that it took this long for the gmond to process the metric data it had
 received via UDP from all the other servers in the cluster.

 $ netstat -ua
 Active Internet connections (servers and established)
 Proto Recv-Q Send-Q Local Address
 udp   1920032  0 *:8649  *:*

 The solution was to modify gmond and move the TCP request handler into to
 separate thread so that gmond could take as long as it needed to process
 incoming metric data (from UDP receive buffer that is large enough not to
 overflow) without blocking on the TCP requests for the XML data.

 The patched gmond is running without a problem in our environment so I have
 submitted a pull request[1] for it to be included in trunk.

 I can't be 100% sure that this patch will fix your problem but it would be
 worth a try.

 Regards,
 Nick

 [1] https://github.com/ganglia/monitor-core/pull/50


 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs
 chris.burrou...@gmail.com wrote:

 We use ganglia to monitor  500 hosts in multiple datacenters with about
 90k unique host:metric pairs per DC.  We use this data for all of the
 cool graphs in the web UI and for passive alerting.

 One of our checks is to measure TN of load_one on every box (we want to
 make sure gmond is working and correctly updating metrics otherwise we
 could be blind and not know it).  We consider it a failure if TN is 
 600.  This is an arbitrary number but 10 minutes seemed plenty long.

 Unfortunately we are seeing this check fail far too often.  We set up
 two parallel gmetad instances (monitoring identical gmonds) per DC and
 have broken our problem into two classes:
  * (A) only one of the gmetad stops updating for an entire cluster, and
 must be restarted to recover.  Since the gmetad's disagree we know the
 problem is there. [1]
  * (B) Both gmetad's say an individual host has not reported (gmond
 aggregation or sending must be at fault).  This issue is usually
 transient (that is it recovers after some period of time greater than 10
 minutes).

 While attempting to reproduce (A) we ran several additional gmetad
 instances (again polling the same gmonds) around 2012-12-07.  Failures
 per day are below [2].  The act of testing seems to have significantly
 increased the number of failures.

 This lead us to consider if the act of polling a gmond aggregator could
 impact the ability for it to concurrently collect metrics.  We looked at
 the code but are not experienced with concurrent programming in C.
 Could someone with more familiarity with the gmond code comment as to if
 this is likely  to be a worthwhile avenue of investigation?  We are also
 looking to for suggestion for an empirical test to rule this out.

 (Of course, other comments on the root TN goes up, metrics stop
 updating sporadic problem are also welcome!)

 

[Ganglia-general] Impact of gmond polling on data collection

2012-09-14 Thread Chris Burroughs
We use ganglia to monitor  500 hosts in multiple datacenters with about
90k unique host:metric pairs per DC.  We use this data for all of the
cool graphs in the web UI and for passive alerting.

One of our checks is to measure TN of load_one on every box (we want to
make sure gmond is working and correctly updating metrics otherwise we
could be blind and not know it).  We consider it a failure if TN is 
600.  This is an arbitrary number but 10 minutes seemed plenty long.

Unfortunately we are seeing this check fail far too often.  We set up
two parallel gmetad instances (monitoring identical gmonds) per DC and
have broken our problem into two classes:
 * (A) only one of the gmetad stops updating for an entire cluster, and
must be restarted to recover.  Since the gmetad's disagree we know the
problem is there. [1]
 * (B) Both gmetad's say an individual host has not reported (gmond
aggregation or sending must be at fault).  This issue is usually
transient (that is it recovers after some period of time greater than 10
minutes).

While attempting to reproduce (A) we ran several additional gmetad
instances (again polling the same gmonds) around 2012-12-07.  Failures
per day are below [2].  The act of testing seems to have significantly
increased the number of failures.

This lead us to consider if the act of polling a gmond aggregator could
impact the ability for it to concurrently collect metrics.  We looked at
the code but are not experienced with concurrent programming in C.
Could someone with more familiarity with the gmond code comment as to if
this is likely  to be a worthwhile avenue of investigation?  We are also
looking to for suggestion for an empirical test to rule this out.

(Of course, other comments on the root TN goes up, metrics stop
updating sporadic problem are also welcome!)

Thank you,
Chris Burroughs


[1] https://github.com/ganglia/monitor-core/issues/47

[2]
120827  89
120828  6
120829  3
120830  4
120831  5
120901  1
120902  6
120903  2
120904  9
120905  4
120906  70
120907  523
120908  85
120909  4
120910  6
120911  2
120912  5
120913  5

--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general