Re: [Ganglia-general] Impact of gmond polling on data collection
Thanks very much Nicholas. Your reply was very helpful and we are going to try out your settings changes and patches. On 09/17/2012 09:03 AM, Nicholas Satterly wrote: Hi Chris, I've discovered there are two contributing factors to problems like this. 1. the number of metrics being sent (possibly in short bursts) can overflow the UDP receive buffer. 2. the time it takes to process metrics in the UDP receive buffer causes TCP connections from the gmetad's to timeout (currently hard-coded to 10 seconds) In your case, you are probably dropping UDP packets because gmond can't keep up. Gmond was enhanced to allow you to increase the UDP buffer size back in April. I suggest you upgrade to the latest version and set this a sensible value for your environment. udp_recv_channel { port = 1234 buffer = 1024000 } To determine what is sensible is a bit of trial and error. Run netstat -su and keep increasing the value until you no longer see the number of packet receive errors going up. $ netstat -su Udp: 7941393 packets received 23 packets to unknown port received. 0 packet receive errors 10079118 packets sent The other possibility is that it takes so long for a gmetad to pull back all the metrics you are collecting for a cluster that you are preventing the gmond from processing metric data received via UDP. Again this can cause the UDP receive buffer to overflow. The problem we had at my work is related to all of the above but manifested itself in a slightly different way. We were seeing gaps in all our graphs because at times none of the servers in a cluster would respond to gmetad poll within 10 seconds. I used to think that the gmond was completely hung but realised that they would respond normally most of the time but every minute or so it woul take about 20-25 seconds. This happened to coincide with the UDP receive queue growing (Recv-Q column below) and I realised that it took this long for the gmond to process the metric data it had received via UDP from all the other servers in the cluster. $ netstat -ua Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address udp 1920032 0 *:8649 *:* The solution was to modify gmond and move the TCP request handler into to separate thread so that gmond could take as long as it needed to process incoming metric data (from UDP receive buffer that is large enough not to overflow) without blocking on the TCP requests for the XML data. The patched gmond is running without a problem in our environment so I have submitted a pull request[1] for it to be included in trunk. I can't be 100% sure that this patch will fix your problem but it would be worth a try. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/50 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com wrote: We use ganglia to monitor 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting. One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far too often. We set up two parallel gmetad instances (monitoring identical gmonds) per DC and have broken our problem into two classes: * (A) only one of the gmetad stops updating for an entire cluster, and must be restarted to recover. Since the gmetad's disagree we know the problem is there. [1] * (B) Both gmetad's say an individual host has not reported (gmond aggregation or sending must be at fault). This issue is usually transient (that is it recovers after some period of time greater than 10 minutes). While attempting to reproduce (A) we ran several additional gmetad instances (again polling the same gmonds) around 2012-12-07. Failures per day are below [2]. The act of testing seems to have significantly increased the number of failures. This lead us to consider if the act of polling a gmond aggregator could impact the ability for it to concurrently collect metrics. We looked at the code but are not experienced with concurrent programming in C. Could someone with more familiarity with the gmond code comment as to if this is likely to be a worthwhile avenue of investigation? We are also looking to for suggestion for an empirical test to rule this out. (Of course, other comments on the root TN goes up, metrics stop updating sporadic problem are also welcome!) Thank you, Chris Burroughs [1] https://github.com/ganglia/monitor-core/issues/47 [2] 120827 89 120828 6 120829 3 120830 4 120831 5 120901 1 120902 6 120903 2 120904 9 120905 4
Re: [Ganglia-general] Impact of gmond polling on data collection
Hi Peter, Thanks for the feedback. I've added a thread mutex to the hosts hash table as you suggested and will send a pull request in the next day or so. Regards, Nick On Mon, Sep 17, 2012 at 8:25 PM, Peter Phaal peter.ph...@gmail.com wrote: Nicholas, It makes sense to multi-thread gmond, but looking at your patch, I don't see any locking associated with the hosts hashtable. Isn't there a possible race if new hosts/metrics are added to the hashtable by the UDP thread at the same time the hashtable is being walked by the TCP thread? Peter On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Chris, I've discovered there are two contributing factors to problems like this. 1. the number of metrics being sent (possibly in short bursts) can overflow the UDP receive buffer. 2. the time it takes to process metrics in the UDP receive buffer causes TCP connections from the gmetad's to timeout (currently hard-coded to 10 seconds) In your case, you are probably dropping UDP packets because gmond can't keep up. Gmond was enhanced to allow you to increase the UDP buffer size back in April. I suggest you upgrade to the latest version and set this a sensible value for your environment. udp_recv_channel { port = 1234 buffer = 1024000 } To determine what is sensible is a bit of trial and error. Run netstat -su and keep increasing the value until you no longer see the number of packet receive errors going up. $ netstat -su Udp: 7941393 packets received 23 packets to unknown port received. 0 packet receive errors 10079118 packets sent The other possibility is that it takes so long for a gmetad to pull back all the metrics you are collecting for a cluster that you are preventing the gmond from processing metric data received via UDP. Again this can cause the UDP receive buffer to overflow. The problem we had at my work is related to all of the above but manifested itself in a slightly different way. We were seeing gaps in all our graphs because at times none of the servers in a cluster would respond to gmetad poll within 10 seconds. I used to think that the gmond was completely hung but realised that they would respond normally most of the time but every minute or so it woul take about 20-25 seconds. This happened to coincide with the UDP receive queue growing (Recv-Q column below) and I realised that it took this long for the gmond to process the metric data it had received via UDP from all the other servers in the cluster. $ netstat -ua Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address udp 1920032 0 *:8649 *:* The solution was to modify gmond and move the TCP request handler into to separate thread so that gmond could take as long as it needed to process incoming metric data (from UDP receive buffer that is large enough not to overflow) without blocking on the TCP requests for the XML data. The patched gmond is running without a problem in our environment so I have submitted a pull request[1] for it to be included in trunk. I can't be 100% sure that this patch will fix your problem but it would be worth a try. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/50 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com wrote: We use ganglia to monitor 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting. One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far too often. We set up two parallel gmetad instances (monitoring identical gmonds) per DC and have broken our problem into two classes: * (A) only one of the gmetad stops updating for an entire cluster, and must be restarted to recover. Since the gmetad's disagree we know the problem is there. [1] * (B) Both gmetad's say an individual host has not reported (gmond aggregation or sending must be at fault). This issue is usually transient (that is it recovers after some period of time greater than 10 minutes). While attempting to reproduce (A) we ran several additional gmetad instances (again polling the same gmonds) around 2012-12-07. Failures per day are below [2]. The act of testing seems to have significantly increased the number of failures. This lead us to consider if the act of polling a gmond aggregator could impact the ability for it to concurrently collect metrics. We looked at the code but are not experienced
Re: [Ganglia-general] Impact of gmond polling on data collection
Nick, I think you probably need two mutexes if you want to avoid blocking the UDP thread unnecessarily. 1. a mutex on the hastable that must be grabbed by the TCP thread when it walks the hash table and the UDP thread would grab it any time it adds or removes an entry from the hash table. 2. a mutex used to control access to individual entries in the hashtable. The TCP thread would grap and release this mutex for each entry as it walks the hash table. The UDP thread would grab this mutex each time it updates an entry. The only situation in which this locking scheme would block the UDP thread for any significant time is when a new host starts sending metrics and a new entry needs to be added to the hash table. This is a rare event and not much of a concern. The TCP thread should never have to wait long to acquire either of the mutexes. Peter On Wed, Sep 19, 2012 at 8:45 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Peter, Thanks for the feedback. I've added a thread mutex to the hosts hash table as you suggested and will send a pull request in the next day or so. Regards, Nick On Mon, Sep 17, 2012 at 8:25 PM, Peter Phaal peter.ph...@gmail.com wrote: Nicholas, It makes sense to multi-thread gmond, but looking at your patch, I don't see any locking associated with the hosts hashtable. Isn't there a possible race if new hosts/metrics are added to the hashtable by the UDP thread at the same time the hashtable is being walked by the TCP thread? Peter On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Chris, I've discovered there are two contributing factors to problems like this. 1. the number of metrics being sent (possibly in short bursts) can overflow the UDP receive buffer. 2. the time it takes to process metrics in the UDP receive buffer causes TCP connections from the gmetad's to timeout (currently hard-coded to 10 seconds) In your case, you are probably dropping UDP packets because gmond can't keep up. Gmond was enhanced to allow you to increase the UDP buffer size back in April. I suggest you upgrade to the latest version and set this a sensible value for your environment. udp_recv_channel { port = 1234 buffer = 1024000 } To determine what is sensible is a bit of trial and error. Run netstat -su and keep increasing the value until you no longer see the number of packet receive errors going up. $ netstat -su Udp: 7941393 packets received 23 packets to unknown port received. 0 packet receive errors 10079118 packets sent The other possibility is that it takes so long for a gmetad to pull back all the metrics you are collecting for a cluster that you are preventing the gmond from processing metric data received via UDP. Again this can cause the UDP receive buffer to overflow. The problem we had at my work is related to all of the above but manifested itself in a slightly different way. We were seeing gaps in all our graphs because at times none of the servers in a cluster would respond to gmetad poll within 10 seconds. I used to think that the gmond was completely hung but realised that they would respond normally most of the time but every minute or so it woul take about 20-25 seconds. This happened to coincide with the UDP receive queue growing (Recv-Q column below) and I realised that it took this long for the gmond to process the metric data it had received via UDP from all the other servers in the cluster. $ netstat -ua Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address udp 1920032 0 *:8649 *:* The solution was to modify gmond and move the TCP request handler into to separate thread so that gmond could take as long as it needed to process incoming metric data (from UDP receive buffer that is large enough not to overflow) without blocking on the TCP requests for the XML data. The patched gmond is running without a problem in our environment so I have submitted a pull request[1] for it to be included in trunk. I can't be 100% sure that this patch will fix your problem but it would be worth a try. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/50 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com wrote: We use ganglia to monitor 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting. One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far
Re: [Ganglia-general] Impact of gmond polling on data collection
Hi Peter, I've submitted another pull request covering a mutex for the hostdata hash table. Thanks again for your guidance. Regards, Nick On Wed, Sep 19, 2012 at 5:53 PM, Peter Phaal peter.ph...@gmail.com wrote: Nick, I think you probably need two mutexes if you want to avoid blocking the UDP thread unnecessarily. 1. a mutex on the hastable that must be grabbed by the TCP thread when it walks the hash table and the UDP thread would grab it any time it adds or removes an entry from the hash table. https://github.com/ganglia/monitor-core/pull/51 2. a mutex used to control access to individual entries in the hashtable. The TCP thread would grap and release this mutex for each entry as it walks the hash table. The UDP thread would grab this mutex each time it updates an entry. https://github.com/ganglia/monitor-core/pull/52 The only situation in which this locking scheme would block the UDP thread for any significant time is when a new host starts sending metrics and a new entry needs to be added to the hash table. This is a rare event and not much of a concern. The TCP thread should never have to wait long to acquire either of the mutexes. Peter On Wed, Sep 19, 2012 at 8:45 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Peter, Thanks for the feedback. I've added a thread mutex to the hosts hash table as you suggested and will send a pull request in the next day or so. Regards, Nick On Mon, Sep 17, 2012 at 8:25 PM, Peter Phaal peter.ph...@gmail.com wrote: Nicholas, It makes sense to multi-thread gmond, but looking at your patch, I don't see any locking associated with the hosts hashtable. Isn't there a possible race if new hosts/metrics are added to the hashtable by the UDP thread at the same time the hashtable is being walked by the TCP thread? Peter On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Chris, I've discovered there are two contributing factors to problems like this. 1. the number of metrics being sent (possibly in short bursts) can overflow the UDP receive buffer. 2. the time it takes to process metrics in the UDP receive buffer causes TCP connections from the gmetad's to timeout (currently hard-coded to 10 seconds) In your case, you are probably dropping UDP packets because gmond can't keep up. Gmond was enhanced to allow you to increase the UDP buffer size back in April. I suggest you upgrade to the latest version and set this a sensible value for your environment. udp_recv_channel { port = 1234 buffer = 1024000 } To determine what is sensible is a bit of trial and error. Run netstat -su and keep increasing the value until you no longer see the number of packet receive errors going up. $ netstat -su Udp: 7941393 packets received 23 packets to unknown port received. 0 packet receive errors 10079118 packets sent The other possibility is that it takes so long for a gmetad to pull back all the metrics you are collecting for a cluster that you are preventing the gmond from processing metric data received via UDP. Again this can cause the UDP receive buffer to overflow. The problem we had at my work is related to all of the above but manifested itself in a slightly different way. We were seeing gaps in all our graphs because at times none of the servers in a cluster would respond to gmetad poll within 10 seconds. I used to think that the gmond was completely hung but realised that they would respond normally most of the time but every minute or so it woul take about 20-25 seconds. This happened to coincide with the UDP receive queue growing (Recv-Q column below) and I realised that it took this long for the gmond to process the metric data it had received via UDP from all the other servers in the cluster. $ netstat -ua Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address udp 1920032 0 *:8649 *:* The solution was to modify gmond and move the TCP request handler into to separate thread so that gmond could take as long as it needed to process incoming metric data (from UDP receive buffer that is large enough not to overflow) without blocking on the TCP requests for the XML data. The patched gmond is running without a problem in our environment so I have submitted a pull request[1] for it to be included in trunk. I can't be 100% sure that this patch will fix your problem but it would be worth a try. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/50 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com wrote: We use ganglia to monitor 500 hosts in multiple datacenters with about
Re: [Ganglia-general] Impact of gmond polling on data collection
in gmond.c:process_tc_accept_channel() could those goto statements close the socket and return without relinquishing the mutex? Neil On Sep 19, 2012, at 8:45 AM, Nicholas Satterly wrote: Hi Peter, Thanks for the feedback. I've added a thread mutex to the hosts hash table as you suggested and will send a pull request in the next day or so. Regards, Nick On Mon, Sep 17, 2012 at 8:25 PM, Peter Phaal peter.ph...@gmail.com wrote: Nicholas, It makes sense to multi-thread gmond, but looking at your patch, I don't see any locking associated with the hosts hashtable. Isn't there a possible race if new hosts/metrics are added to the hashtable by the UDP thread at the same time the hashtable is being walked by the TCP thread? Peter On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Chris, I've discovered there are two contributing factors to problems like this. 1. the number of metrics being sent (possibly in short bursts) can overflow the UDP receive buffer. 2. the time it takes to process metrics in the UDP receive buffer causes TCP connections from the gmetad's to timeout (currently hard-coded to 10 seconds) In your case, you are probably dropping UDP packets because gmond can't keep up. Gmond was enhanced to allow you to increase the UDP buffer size back in April. I suggest you upgrade to the latest version and set this a sensible value for your environment. udp_recv_channel { port = 1234 buffer = 1024000 } To determine what is sensible is a bit of trial and error. Run netstat -su and keep increasing the value until you no longer see the number of packet receive errors going up. $ netstat -su Udp: 7941393 packets received 23 packets to unknown port received. 0 packet receive errors 10079118 packets sent The other possibility is that it takes so long for a gmetad to pull back all the metrics you are collecting for a cluster that you are preventing the gmond from processing metric data received via UDP. Again this can cause the UDP receive buffer to overflow. The problem we had at my work is related to all of the above but manifested itself in a slightly different way. We were seeing gaps in all our graphs because at times none of the servers in a cluster would respond to gmetad poll within 10 seconds. I used to think that the gmond was completely hung but realised that they would respond normally most of the time but every minute or so it woul take about 20-25 seconds. This happened to coincide with the UDP receive queue growing (Recv-Q column below) and I realised that it took this long for the gmond to process the metric data it had received via UDP from all the other servers in the cluster. $ netstat -ua Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address udp 1920032 0 *:8649 *:* The solution was to modify gmond and move the TCP request handler into to separate thread so that gmond could take as long as it needed to process incoming metric data (from UDP receive buffer that is large enough not to overflow) without blocking on the TCP requests for the XML data. The patched gmond is running without a problem in our environment so I have submitted a pull request[1] for it to be included in trunk. I can't be 100% sure that this patch will fix your problem but it would be worth a try. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/50 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com wrote: We use ganglia to monitor 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting. One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far too often. We set up two parallel gmetad instances (monitoring identical gmonds) per DC and have broken our problem into two classes: * (A) only one of the gmetad stops updating for an entire cluster, and must be restarted to recover. Since the gmetad's disagree we know the problem is there. [1] * (B) Both gmetad's say an individual host has not reported (gmond aggregation or sending must be at fault). This issue is usually transient (that is it recovers after some period of time greater than 10 minutes). While attempting to reproduce (A) we ran several additional gmetad instances (again polling the same gmonds) around 2012-12-07. Failures per day are below [2]. The act of testing seems to have significantly increased the number of
Re: [Ganglia-general] Impact of gmond polling on data collection
Hi Chris, I've discovered there are two contributing factors to problems like this. 1. the number of metrics being sent (possibly in short bursts) can overflow the UDP receive buffer. 2. the time it takes to process metrics in the UDP receive buffer causes TCP connections from the gmetad's to timeout (currently hard-coded to 10 seconds) In your case, you are probably dropping UDP packets because gmond can't keep up. Gmond was enhanced to allow you to increase the UDP buffer size back in April. I suggest you upgrade to the latest version and set this a sensible value for your environment. udp_recv_channel { port = 1234 buffer = 1024000 } To determine what is sensible is a bit of trial and error. Run netstat -su and keep increasing the value until you no longer see the number of packet receive errors going up. $ netstat -su Udp: 7941393 packets received 23 packets to unknown port received. 0 packet receive errors 10079118 packets sent The other possibility is that it takes so long for a gmetad to pull back all the metrics you are collecting for a cluster that you are preventing the gmond from processing metric data received via UDP. Again this can cause the UDP receive buffer to overflow. The problem we had at my work is related to all of the above but manifested itself in a slightly different way. We were seeing gaps in all our graphs because at times none of the servers in a cluster would respond to gmetad poll within 10 seconds. I used to think that the gmond was completely hung but realised that they would respond normally most of the time but every minute or so it woul take about 20-25 seconds. This happened to coincide with the UDP receive queue growing (Recv-Q column below) and I realised that it took this long for the gmond to process the metric data it had received via UDP from all the other servers in the cluster. $ netstat -ua Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address udp 1920032 0 *:8649 *:* The solution was to modify gmond and move the TCP request handler into to separate thread so that gmond could take as long as it needed to process incoming metric data (from UDP receive buffer that is large enough not to overflow) without blocking on the TCP requests for the XML data. The patched gmond is running without a problem in our environment so I have submitted a pull request[1] for it to be included in trunk. I can't be 100% sure that this patch will fix your problem but it would be worth a try. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/50 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com wrote: We use ganglia to monitor 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting. One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far too often. We set up two parallel gmetad instances (monitoring identical gmonds) per DC and have broken our problem into two classes: * (A) only one of the gmetad stops updating for an entire cluster, and must be restarted to recover. Since the gmetad's disagree we know the problem is there. [1] * (B) Both gmetad's say an individual host has not reported (gmond aggregation or sending must be at fault). This issue is usually transient (that is it recovers after some period of time greater than 10 minutes). While attempting to reproduce (A) we ran several additional gmetad instances (again polling the same gmonds) around 2012-12-07. Failures per day are below [2]. The act of testing seems to have significantly increased the number of failures. This lead us to consider if the act of polling a gmond aggregator could impact the ability for it to concurrently collect metrics. We looked at the code but are not experienced with concurrent programming in C. Could someone with more familiarity with the gmond code comment as to if this is likely to be a worthwhile avenue of investigation? We are also looking to for suggestion for an empirical test to rule this out. (Of course, other comments on the root TN goes up, metrics stop updating sporadic problem are also welcome!) Thank you, Chris Burroughs [1] https://github.com/ganglia/monitor-core/issues/47 [2] 120827 89 120828 6 120829 3 120830 4 120831 5 120901 1 120902 6 120903 2 120904 9 120905 4 120906 70 120907 523 120908 85 120909 4 120910 6 120911 2 120912 5 120913 5 -- Got visibility? Most devs has no idea what their production app
Re: [Ganglia-general] Impact of gmond polling on data collection
Nicholas, It makes sense to multi-thread gmond, but looking at your patch, I don't see any locking associated with the hosts hashtable. Isn't there a possible race if new hosts/metrics are added to the hashtable by the UDP thread at the same time the hashtable is being walked by the TCP thread? Peter On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Chris, I've discovered there are two contributing factors to problems like this. 1. the number of metrics being sent (possibly in short bursts) can overflow the UDP receive buffer. 2. the time it takes to process metrics in the UDP receive buffer causes TCP connections from the gmetad's to timeout (currently hard-coded to 10 seconds) In your case, you are probably dropping UDP packets because gmond can't keep up. Gmond was enhanced to allow you to increase the UDP buffer size back in April. I suggest you upgrade to the latest version and set this a sensible value for your environment. udp_recv_channel { port = 1234 buffer = 1024000 } To determine what is sensible is a bit of trial and error. Run netstat -su and keep increasing the value until you no longer see the number of packet receive errors going up. $ netstat -su Udp: 7941393 packets received 23 packets to unknown port received. 0 packet receive errors 10079118 packets sent The other possibility is that it takes so long for a gmetad to pull back all the metrics you are collecting for a cluster that you are preventing the gmond from processing metric data received via UDP. Again this can cause the UDP receive buffer to overflow. The problem we had at my work is related to all of the above but manifested itself in a slightly different way. We were seeing gaps in all our graphs because at times none of the servers in a cluster would respond to gmetad poll within 10 seconds. I used to think that the gmond was completely hung but realised that they would respond normally most of the time but every minute or so it woul take about 20-25 seconds. This happened to coincide with the UDP receive queue growing (Recv-Q column below) and I realised that it took this long for the gmond to process the metric data it had received via UDP from all the other servers in the cluster. $ netstat -ua Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address udp 1920032 0 *:8649 *:* The solution was to modify gmond and move the TCP request handler into to separate thread so that gmond could take as long as it needed to process incoming metric data (from UDP receive buffer that is large enough not to overflow) without blocking on the TCP requests for the XML data. The patched gmond is running without a problem in our environment so I have submitted a pull request[1] for it to be included in trunk. I can't be 100% sure that this patch will fix your problem but it would be worth a try. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/50 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com wrote: We use ganglia to monitor 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting. One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far too often. We set up two parallel gmetad instances (monitoring identical gmonds) per DC and have broken our problem into two classes: * (A) only one of the gmetad stops updating for an entire cluster, and must be restarted to recover. Since the gmetad's disagree we know the problem is there. [1] * (B) Both gmetad's say an individual host has not reported (gmond aggregation or sending must be at fault). This issue is usually transient (that is it recovers after some period of time greater than 10 minutes). While attempting to reproduce (A) we ran several additional gmetad instances (again polling the same gmonds) around 2012-12-07. Failures per day are below [2]. The act of testing seems to have significantly increased the number of failures. This lead us to consider if the act of polling a gmond aggregator could impact the ability for it to concurrently collect metrics. We looked at the code but are not experienced with concurrent programming in C. Could someone with more familiarity with the gmond code comment as to if this is likely to be a worthwhile avenue of investigation? We are also looking to for suggestion for an empirical test to rule this out. (Of course, other comments on the root TN goes up, metrics stop updating sporadic problem are also welcome!)
[Ganglia-general] Impact of gmond polling on data collection
We use ganglia to monitor 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting. One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far too often. We set up two parallel gmetad instances (monitoring identical gmonds) per DC and have broken our problem into two classes: * (A) only one of the gmetad stops updating for an entire cluster, and must be restarted to recover. Since the gmetad's disagree we know the problem is there. [1] * (B) Both gmetad's say an individual host has not reported (gmond aggregation or sending must be at fault). This issue is usually transient (that is it recovers after some period of time greater than 10 minutes). While attempting to reproduce (A) we ran several additional gmetad instances (again polling the same gmonds) around 2012-12-07. Failures per day are below [2]. The act of testing seems to have significantly increased the number of failures. This lead us to consider if the act of polling a gmond aggregator could impact the ability for it to concurrently collect metrics. We looked at the code but are not experienced with concurrent programming in C. Could someone with more familiarity with the gmond code comment as to if this is likely to be a worthwhile avenue of investigation? We are also looking to for suggestion for an empirical test to rule this out. (Of course, other comments on the root TN goes up, metrics stop updating sporadic problem are also welcome!) Thank you, Chris Burroughs [1] https://github.com/ganglia/monitor-core/issues/47 [2] 120827 89 120828 6 120829 3 120830 4 120831 5 120901 1 120902 6 120903 2 120904 9 120905 4 120906 70 120907 523 120908 85 120909 4 120910 6 120911 2 120912 5 120913 5 -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general