Re: [Ganglia-general] Hosts Up and Down information

2011-07-22 Thread Indranil C
This is so awesome !!! Thanks a lot Ron. I meanwhile came up with a code to 
probe the xml retrieved by running nc lt;ipgt; lt;portgt; and checking 
for hosts with TN gt; TMAX*4 and according populate arrays for dead and alive 
servers. But I think now I can use this directly.

Thanks,

Neel



Treat yourself at a restaurant, spa, resort and much more with Rediff Deal ho 
jaye!


From: Ron Wellnitz lt;ron.welln...@debeka.degt;
Sent: Fri, 22 Jul 2011 12:50:37 
To: indran...@rediff.co.in
Cc: ganglia-general lt;ganglia-general@lists.sourceforge.netgt;
Subject: Re: [Ganglia-general] Hosts Up and Down information


  
  


Hi Nell,



please try first gstat on one Node which running gmetad.



Example:

nbsp;gstat -a -d -1 -p lt;xml_port - Default: 8649gt;

nbsp;gstat -a -d -1 -p 8647



The Output looks like this:



LUSTER INFORMATION

nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; Name: 

nbsp;nbsp;nbsp;nbsp;nbsp; Hosts: 108

Gexec Hosts: 0

nbsp;Dead Hosts: 2

nbsp; Localtime: Fri Jul 22 09:15:20 2011



DEAD CLUSTER HOSTS

nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;
 Hostnamenbsp;nbsp; Last Reported

nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;
  nbsp; Thu Jul 14 08:22:52 2011

nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;
  nbsp; Fri Apr 29 13:33:23 2011



Regards,

Ron



Am 22.07.2011 07:23, schrieb Indranil C:
Could any one please point me towards the piece of code or
the logic which ganglia uses to display the Hosts Up and Hosts Down
stats. I want to set up a alert system based on 20% hosts down
threshold. I am not very well versed with either php or html templates
and finding it little difficult to figure out how Ganglia is getting
the values.
  

Thanks,

Neel

  

Treat yourself at a restaurant, spa, resort and much more with Rediff
Deal ho jaye!
  
--
10 Tips for Better Web Security
Learn 10 ways to better secure your business today. Topics covered include:
Web security, SSL, hacker attacks amp; Denial of Service (DoS), private keys,
security Microsoft Exchange, secure Instant Messaging, and much more.
http://www.accelacomm.com/jaw/sfnl/114/51426210/
  
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general
  



--
10 Tips for Better Web Security
Learn 10 ways to better secure your business today. Topics covered include:
Web security, SSL, hacker attacks  Denial of Service (DoS), private keys,
security Microsoft Exchange, secure Instant Messaging, and much more.
http://www.accelacomm.com/jaw/sfnl/114/51426210/___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] missing many samples with host-sflow...

2011-07-22 Thread Neil Mckee
500 nodes sending sFlow-HOST data is probably only about 25 packets/sec,  so 
the issue here is unlikely to be a performance bottleneck in terms of CPU, 
network bandwidth,  UDP buffers etc.

Right now the most likely explanation seems to be some race-condition over how 
long before gmond considers the data to be stale.  In the function sflow.c: 
process_sflow_gmetric() we have this:

  gfull-metric.tmax = 60; /* (secs) poll if it changes faster than this */
  gfull-metric.dmax = 0; /* (secs) how long before stale? */

I was under the impression that setting dmax to 0 is supposed to mean that 
the data does not expire at all,  but maybe this assumption is wrong?

Please confirm that you are running hsflowd with a polling-interval set to 30 
seconds or less,  and please confirm that the CPU is not busy.

The other step we could take is to log the values of lostDatagrams and 
lostSamples when the debug level is set on the command line (these counters 
that are maintained within sflow.c but not logged at the moment).  That would 
help to confirm or deny if there is any bottleneck in the front end.  The gmond 
process blocks while the XML data is being extracted.   So if you were 
extracting the XML data over a slow link to a slow device and it took a number 
of seconds to transfer,  then you might conceivably lose packets due to the UDP 
input buffer overflowing during that time.  If that is happening it will show 
up in the lostDatagrams counter.   The workaround might just be to ioctl() the 
input socket buffer to a bigger size.   I've seen this bumped up from about 
130K to over 2MB before,   so that would buy more time without having to do 
anything more elaborate.

Regards,
Neil


On Jul 21, 2011, at 12:32 PM, Robert Jordan wrote:

 I have a cluster with approximately 500 nodes reporting via host-sflow to a 
 single gmond.  In the past few days my graphs have started to look like 
 dotted lines and most of the time ganglia reports all of the nodes as down.  
 Has anyone seen similar issues? 
 --
 5 Ways to Improve  Secure Unified Communications
 Unified Communications promises greater efficiencies for business. UC can 
 improve internal communications as well as offer faster, more efficient ways
 to interact with customers and streamline customer service. Learn more!
 http://www.accelacomm.com/jaw/sfnl/114/51426253/___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general


--
10 Tips for Better Web Security
Learn 10 ways to better secure your business today. Topics covered include:
Web security, SSL, hacker attacks  Denial of Service (DoS), private keys,
security Microsoft Exchange, secure Instant Messaging, and much more.
http://www.accelacomm.com/jaw/sfnl/114/51426210/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] missing many samples with host-sflow...

2011-07-22 Thread Neil Mckee
Upon investigation we found that a handful of the nodes were sending with 
sFlow-agent-address == 0.0.0.0.   These nodes boot using DHCP so this may be a 
race where the hsflowd daemon starts before the IP address has been learned.   
The fix will be to make hsflowd wait until it has a current IP address before 
sending (and check for changes periodically).  And at the gmond end,  we should 
probably add a check to ignore any datagrams that have 
sFlow-agent-address==0.0.0.0.

Because multiple nodes were sending with the same agent address the affect was 
to alias their data together so that it looked like successive readings from 
the same node.  Most of the time the resulting sequence number deltas were such 
that the data was being ignored anyway,  but as clocks drift over time it's 
possible that some readings would get through and result in astronomically high 
deltas being recorded.If that happened and these large deltas were enough 
to trip a sanity-check somewhere further on (perhaps in gmetad),  then that 
could explain how the gaps appeared in the chart for the whole cluster.

Neil



On Jul 22, 2011, at 1:06 PM, Neil Mckee wrote:

 500 nodes sending sFlow-HOST data is probably only about 25 packets/sec,  so 
 the issue here is unlikely to be a performance bottleneck in terms of CPU, 
 network bandwidth,  UDP buffers etc.
 
 Right now the most likely explanation seems to be some race-condition over 
 how long before gmond considers the data to be stale.  In the function 
 sflow.c: process_sflow_gmetric() we have this:
 
  gfull-metric.tmax = 60; /* (secs) poll if it changes faster than this */
  gfull-metric.dmax = 0; /* (secs) how long before stale? */
 
 I was under the impression that setting dmax to 0 is supposed to mean that 
 the data does not expire at all,  but maybe this assumption is wrong?
 
 Please confirm that you are running hsflowd with a polling-interval set to 30 
 seconds or less,  and please confirm that the CPU is not busy.
 
 The other step we could take is to log the values of lostDatagrams and 
 lostSamples when the debug level is set on the command line (these counters 
 that are maintained within sflow.c but not logged at the moment).  That would 
 help to confirm or deny if there is any bottleneck in the front end.  The 
 gmond process blocks while the XML data is being extracted.   So if you were 
 extracting the XML data over a slow link to a slow device and it took a 
 number of seconds to transfer,  then you might conceivably lose packets due 
 to the UDP input buffer overflowing during that time.  If that is happening 
 it will show up in the lostDatagrams counter.   The workaround might just be 
 to ioctl() the input socket buffer to a bigger size.   I've seen this bumped 
 up from about 130K to over 2MB before,   so that would buy more time without 
 having to do anything more elaborate.
 
 Regards,
 Neil
 
 
 On Jul 21, 2011, at 12:32 PM, Robert Jordan wrote:
 
 I have a cluster with approximately 500 nodes reporting via host-sflow to a 
 single gmond.  In the past few days my graphs have started to look like 
 dotted lines and most of the time ganglia reports all of the nodes as down.  
 Has anyone seen similar issues? 
 --
 5 Ways to Improve  Secure Unified Communications
 Unified Communications promises greater efficiencies for business. UC can 
 improve internal communications as well as offer faster, more efficient ways
 to interact with customers and streamline customer service. Learn more!
 http://www.accelacomm.com/jaw/sfnl/114/51426253/___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general
 


--
Storage Efficiency Calculator
This modeling tool is based on patent-pending intellectual property that
has been used successfully in hundreds of IBM storage optimization engage-
ments, worldwide.  Store less, Store more with what you own, Move data to 
the right place. Try It Now! http://www.accelacomm.com/jaw/sfnl/114/51427378/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general