Re: [Ganglia-general] Hosts Up and Down information
This is so awesome !!! Thanks a lot Ron. I meanwhile came up with a code to probe the xml retrieved by running nc lt;ipgt; lt;portgt; and checking for hosts with TN gt; TMAX*4 and according populate arrays for dead and alive servers. But I think now I can use this directly. Thanks, Neel Treat yourself at a restaurant, spa, resort and much more with Rediff Deal ho jaye! From: Ron Wellnitz lt;ron.welln...@debeka.degt; Sent: Fri, 22 Jul 2011 12:50:37 To: indran...@rediff.co.in Cc: ganglia-general lt;ganglia-general@lists.sourceforge.netgt; Subject: Re: [Ganglia-general] Hosts Up and Down information Hi Nell, please try first gstat on one Node which running gmetad. Example: nbsp;gstat -a -d -1 -p lt;xml_port - Default: 8649gt; nbsp;gstat -a -d -1 -p 8647 The Output looks like this: LUSTER INFORMATION nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; Name: nbsp;nbsp;nbsp;nbsp;nbsp; Hosts: 108 Gexec Hosts: 0 nbsp;Dead Hosts: 2 nbsp; Localtime: Fri Jul 22 09:15:20 2011 DEAD CLUSTER HOSTS nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; Hostnamenbsp;nbsp; Last Reported nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; nbsp; Thu Jul 14 08:22:52 2011 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; nbsp; Fri Apr 29 13:33:23 2011 Regards, Ron Am 22.07.2011 07:23, schrieb Indranil C: Could any one please point me towards the piece of code or the logic which ganglia uses to display the Hosts Up and Hosts Down stats. I want to set up a alert system based on 20% hosts down threshold. I am not very well versed with either php or html templates and finding it little difficult to figure out how Ganglia is getting the values. Thanks, Neel Treat yourself at a restaurant, spa, resort and much more with Rediff Deal ho jaye! -- 10 Tips for Better Web Security Learn 10 ways to better secure your business today. Topics covered include: Web security, SSL, hacker attacks amp; Denial of Service (DoS), private keys, security Microsoft Exchange, secure Instant Messaging, and much more. http://www.accelacomm.com/jaw/sfnl/114/51426210/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- 10 Tips for Better Web Security Learn 10 ways to better secure your business today. Topics covered include: Web security, SSL, hacker attacks Denial of Service (DoS), private keys, security Microsoft Exchange, secure Instant Messaging, and much more. http://www.accelacomm.com/jaw/sfnl/114/51426210/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] missing many samples with host-sflow...
500 nodes sending sFlow-HOST data is probably only about 25 packets/sec, so the issue here is unlikely to be a performance bottleneck in terms of CPU, network bandwidth, UDP buffers etc. Right now the most likely explanation seems to be some race-condition over how long before gmond considers the data to be stale. In the function sflow.c: process_sflow_gmetric() we have this: gfull-metric.tmax = 60; /* (secs) poll if it changes faster than this */ gfull-metric.dmax = 0; /* (secs) how long before stale? */ I was under the impression that setting dmax to 0 is supposed to mean that the data does not expire at all, but maybe this assumption is wrong? Please confirm that you are running hsflowd with a polling-interval set to 30 seconds or less, and please confirm that the CPU is not busy. The other step we could take is to log the values of lostDatagrams and lostSamples when the debug level is set on the command line (these counters that are maintained within sflow.c but not logged at the moment). That would help to confirm or deny if there is any bottleneck in the front end. The gmond process blocks while the XML data is being extracted. So if you were extracting the XML data over a slow link to a slow device and it took a number of seconds to transfer, then you might conceivably lose packets due to the UDP input buffer overflowing during that time. If that is happening it will show up in the lostDatagrams counter. The workaround might just be to ioctl() the input socket buffer to a bigger size. I've seen this bumped up from about 130K to over 2MB before, so that would buy more time without having to do anything more elaborate. Regards, Neil On Jul 21, 2011, at 12:32 PM, Robert Jordan wrote: I have a cluster with approximately 500 nodes reporting via host-sflow to a single gmond. In the past few days my graphs have started to look like dotted lines and most of the time ganglia reports all of the nodes as down. Has anyone seen similar issues? -- 5 Ways to Improve Secure Unified Communications Unified Communications promises greater efficiencies for business. UC can improve internal communications as well as offer faster, more efficient ways to interact with customers and streamline customer service. Learn more! http://www.accelacomm.com/jaw/sfnl/114/51426253/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- 10 Tips for Better Web Security Learn 10 ways to better secure your business today. Topics covered include: Web security, SSL, hacker attacks Denial of Service (DoS), private keys, security Microsoft Exchange, secure Instant Messaging, and much more. http://www.accelacomm.com/jaw/sfnl/114/51426210/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] missing many samples with host-sflow...
Upon investigation we found that a handful of the nodes were sending with sFlow-agent-address == 0.0.0.0. These nodes boot using DHCP so this may be a race where the hsflowd daemon starts before the IP address has been learned. The fix will be to make hsflowd wait until it has a current IP address before sending (and check for changes periodically). And at the gmond end, we should probably add a check to ignore any datagrams that have sFlow-agent-address==0.0.0.0. Because multiple nodes were sending with the same agent address the affect was to alias their data together so that it looked like successive readings from the same node. Most of the time the resulting sequence number deltas were such that the data was being ignored anyway, but as clocks drift over time it's possible that some readings would get through and result in astronomically high deltas being recorded.If that happened and these large deltas were enough to trip a sanity-check somewhere further on (perhaps in gmetad), then that could explain how the gaps appeared in the chart for the whole cluster. Neil On Jul 22, 2011, at 1:06 PM, Neil Mckee wrote: 500 nodes sending sFlow-HOST data is probably only about 25 packets/sec, so the issue here is unlikely to be a performance bottleneck in terms of CPU, network bandwidth, UDP buffers etc. Right now the most likely explanation seems to be some race-condition over how long before gmond considers the data to be stale. In the function sflow.c: process_sflow_gmetric() we have this: gfull-metric.tmax = 60; /* (secs) poll if it changes faster than this */ gfull-metric.dmax = 0; /* (secs) how long before stale? */ I was under the impression that setting dmax to 0 is supposed to mean that the data does not expire at all, but maybe this assumption is wrong? Please confirm that you are running hsflowd with a polling-interval set to 30 seconds or less, and please confirm that the CPU is not busy. The other step we could take is to log the values of lostDatagrams and lostSamples when the debug level is set on the command line (these counters that are maintained within sflow.c but not logged at the moment). That would help to confirm or deny if there is any bottleneck in the front end. The gmond process blocks while the XML data is being extracted. So if you were extracting the XML data over a slow link to a slow device and it took a number of seconds to transfer, then you might conceivably lose packets due to the UDP input buffer overflowing during that time. If that is happening it will show up in the lostDatagrams counter. The workaround might just be to ioctl() the input socket buffer to a bigger size. I've seen this bumped up from about 130K to over 2MB before, so that would buy more time without having to do anything more elaborate. Regards, Neil On Jul 21, 2011, at 12:32 PM, Robert Jordan wrote: I have a cluster with approximately 500 nodes reporting via host-sflow to a single gmond. In the past few days my graphs have started to look like dotted lines and most of the time ganglia reports all of the nodes as down. Has anyone seen similar issues? -- 5 Ways to Improve Secure Unified Communications Unified Communications promises greater efficiencies for business. UC can improve internal communications as well as offer faster, more efficient ways to interact with customers and streamline customer service. Learn more! http://www.accelacomm.com/jaw/sfnl/114/51426253/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Storage Efficiency Calculator This modeling tool is based on patent-pending intellectual property that has been used successfully in hundreds of IBM storage optimization engage- ments, worldwide. Store less, Store more with what you own, Move data to the right place. Try It Now! http://www.accelacomm.com/jaw/sfnl/114/51427378/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general