500 nodes sending sFlow-HOST data is probably only about 25 packets/sec, so the issue here is unlikely to be a performance bottleneck in terms of CPU, network bandwidth, UDP buffers etc.
Right now the most likely explanation seems to be some race-condition over how long before gmond considers the data to be "stale". In the function sflow.c: process_sflow_gmetric() we have this: gfull->metric.tmax = 60; /* "(secs) poll if it changes faster than this" */ gfull->metric.dmax = 0; /* "(secs) how long before stale?" */ I was under the impression that setting "dmax" to 0 is supposed to mean that the data does not expire at all, but maybe this assumption is wrong? Please confirm that you are running hsflowd with a polling-interval set to 30 seconds or less, and please confirm that the CPU is not busy. The other step we could take is to log the values of "lostDatagrams" and "lostSamples" when the debug level is set on the command line (these counters that are maintained within sflow.c but not logged at the moment). That would help to confirm or deny if there is any bottleneck in the front end. The gmond process blocks while the XML data is being extracted. So if you were extracting the XML data over a slow link to a slow device and it took a number of seconds to transfer, then you might conceivably lose packets due to the UDP input buffer overflowing during that time. If that is happening it will show up in the lostDatagrams counter. The workaround might just be to ioctl() the input socket buffer to a bigger size. I've seen this bumped up from about 130K to over 2MB before, so that would buy more time without having to do anything more elaborate. Regards, Neil On Jul 21, 2011, at 12:32 PM, Robert Jordan wrote: > I have a cluster with approximately 500 nodes reporting via host-sflow to a > single gmond. In the past few days my graphs have started to look like > dotted lines and most of the time ganglia reports all of the nodes as down. > Has anyone seen similar issues? > ------------------------------------------------------------------------------ > 5 Ways to Improve & Secure Unified Communications > Unified Communications promises greater efficiencies for business. UC can > improve internal communications as well as offer faster, more efficient ways > to interact with customers and streamline customer service. Learn more! > http://www.accelacomm.com/jaw/sfnl/114/51426253/_______________________________________________ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general ------------------------------------------------------------------------------ 10 Tips for Better Web Security Learn 10 ways to better secure your business today. Topics covered include: Web security, SSL, hacker attacks & Denial of Service (DoS), private keys, security Microsoft Exchange, secure Instant Messaging, and much more. http://www.accelacomm.com/jaw/sfnl/114/51426210/ _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general