500 nodes sending sFlow-HOST data is probably only about 25 packets/sec,  so 
the issue here is unlikely to be a performance bottleneck in terms of CPU, 
network bandwidth,  UDP buffers etc.

Right now the most likely explanation seems to be some race-condition over how 
long before gmond considers the data to be "stale".  In the function sflow.c: 
process_sflow_gmetric() we have this:

  gfull->metric.tmax = 60; /* "(secs) poll if it changes faster than this" */
  gfull->metric.dmax = 0; /* "(secs) how long before stale?" */

I was under the impression that setting "dmax" to 0 is supposed to mean that 
the data does not expire at all,  but maybe this assumption is wrong?

Please confirm that you are running hsflowd with a polling-interval set to 30 
seconds or less,  and please confirm that the CPU is not busy.

The other step we could take is to log the values of "lostDatagrams" and 
"lostSamples" when the debug level is set on the command line (these counters 
that are maintained within sflow.c but not logged at the moment).  That would 
help to confirm or deny if there is any bottleneck in the front end.  The gmond 
process blocks while the XML data is being extracted.   So if you were 
extracting the XML data over a slow link to a slow device and it took a number 
of seconds to transfer,  then you might conceivably lose packets due to the UDP 
input buffer overflowing during that time.  If that is happening it will show 
up in the lostDatagrams counter.   The workaround might just be to ioctl() the 
input socket buffer to a bigger size.   I've seen this bumped up from about 
130K to over 2MB before,   so that would buy more time without having to do 
anything more elaborate.

Regards,
Neil


On Jul 21, 2011, at 12:32 PM, Robert Jordan wrote:

> I have a cluster with approximately 500 nodes reporting via host-sflow to a 
> single gmond.  In the past few days my graphs have started to look like 
> dotted lines and most of the time ganglia reports all of the nodes as down.  
> Has anyone seen similar issues? 
> ------------------------------------------------------------------------------
> 5 Ways to Improve & Secure Unified Communications
> Unified Communications promises greater efficiencies for business. UC can 
> improve internal communications as well as offer faster, more efficient ways
> to interact with customers and streamline customer service. Learn more!
> http://www.accelacomm.com/jaw/sfnl/114/51426253/_______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general


------------------------------------------------------------------------------
10 Tips for Better Web Security
Learn 10 ways to better secure your business today. Topics covered include:
Web security, SSL, hacker attacks & Denial of Service (DoS), private keys,
security Microsoft Exchange, secure Instant Messaging, and much more.
http://www.accelacomm.com/jaw/sfnl/114/51426210/
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to