Re: [Ganglia-general] Sflow Apache metrics
Sergey, It's usually best to compile mod-sflow from sources so that it matches the particular version of apache you are running. So before you do that you have the option of editing mod-sflow.c and changing the setting of SFWB_DEFAULT_CONFIGFILE (on line 211). https://code.google.com/p/mod-sflow/source/browse/trunk/mod_sflow.c#211 Does that work for you? Separate question: I'm not sure how hsflowd works if it doesn't start as root? What OS are you on? Neil On Mon, Apr 13, 2015 at 5:55 PM, Sergey svin...@apple.com wrote: I found following error in Apache log: [Mon Apr 13 23:25:14 2015] [error] (2)No such file or directory: apr_stat(/etc/hsflowd.auto) failed The problem is that Hsflowd process is running in the user directory and keeps hsflowd.auto file in ./run directory. I can’t access /etc directory and put file there also, because I don’t have root access. Any ideas? Thanks! S. On Apr 13, 2015, at 9:36 AM, Sergey svin...@apple.com wrote: Yes, I installed sflowtool and it works! I get all counters except http* ones. That’s why I tested http://hostname/sflow page, because it uses mod_sflow in Apache. It looks like some Apache+sflow issue, but I don’t know how to troubleshoot it. Thanks S. On Apr 10, 2015, at 6:28 PM, Leslie geekg...@gmail.com wrote: Have you installed sflowtool and seen if the sflow counters are even getting sent out by the machine ? My next step would be a tcpdump to make sure that the sflow counters are then getting sent to the collecting host. On Fri, Apr 10, 2015 at 4:55 PM, Sergey svin...@apple.com wrote: Hi All! I installed mod_sflow on Apache and try to collect HTTP metrics by Gmond. The problem is that I don’t see any HTTP metrics coming from Hsflow to Gmond, nor HTTP counters via Apache http://hostname/sflow page. There is a list of counters, but they all have 0. Like this: unter method_option_count 0 counter method_get_count 0 counter method_head_count 0 counter method_post_count 0 counter method_put_count 0 counter method_delete_count 0 counter method_trace_count 0 counter method_connect_count 0 counter method_other_count 0 counter status_1XX_count 0 counter status_2XX_count 0 counter status_3XX_count 0 counter status_4XX_count 0 counter status_5XX_count 0 counter status_other_count 0 string hostname xx gauge sampling_n 0 At the same time http://hostname/server-status?auto is working properly: Total Accesses: 15 Total kBytes: 5 Uptime: 149 ReqPerSec: .100671 BytesPerSec: 34.3624 BytesPerReq: 341.333 BusyWorkers: 1 IdleWorkers: 7 Scoreboard: Is there a way to troubleshoot this? I need Sflow metrics. Thanks! S. -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
Re: [Ganglia-general] FW: Ganglia and sFlow
Simon, I don't know if this is still an issue for you, but my understanding is that the cluster name comes from the gmond instance that you send the sFlow to.So if you have 1000 hosts running hsflowd and you want to divide them into 10 clusters then you would run 10 instances of gmond somewhere (with each listening on different udp/tcp ports if they are all on the same host). Then when gmetad gets the latest stats from each one it will do the right thing. I hope someone else will jump in if I got this wrong. Separately, the hostname case-sensitivity thing is tricky. If we ignore the hostname that hsflowd sends and submit the stats using only the IP address then gmond/gmetad will use a reverse-DNS lookup as the name. That might work for some users if their DNS server is consistent and reliable. Alternatively, we could automatically lowercase the hostname that we get from hsflowd. That might work in other places, but it might also make things worse because now you have an identifier that might not match either the DNS name or the Windows case-sensitive hostname. We could try adding new config options for this that apply to the sFlow receiver in gmond, but I don't want to do that if it's just going to make things more confusing. What hostname treatment option do you think would work for you? Neil On Thu, Sep 18, 2014 at 1:10 PM, Simon Ambridge simon.ambri...@qubix.com wrote: Hi I’ve installed Ganglia 3.6.0 gmetad and gmond on an Oracle Linux collector and can successfully collect metrics from Oracle Linux gmond nodes. I also need to collect metrics from Windows 2012 R2 hosts too, so I installed sFlow 1.23.4-x64 – but I then found that I had blank graphs for the Windows node. The Windows machine has an upper-case file name and I saw that the directory under /var/lib/ganglia/rrds was in lower case. I changed $conf['case_sensitive_hostnames'] = false; to true and I now do not get blank graphs for the detailed stats for the Windows node. So far so good. However, I still have the following problems with blank stats on the main page, sFlow node cluster names and how to use conf.php: 1. Even though I get the detailed stats for the Windows machine, the big load_one stacked graph on the main page does not display any details for it. If I link the upper-case directory in /var/lib/ganglia/rrds to a lower-case name it correctly display. So that means that the 'case_sensitive_hostnames' directive is respected by the node stats page but **not** the load_one stacked graph on the main page. The main page also behaves differently because in the drop-down list of nodes the Windows machine is listed in upper-case but on it’s detailed stats page it is titled in lower-case. 2. The Oracle Linux nodes are defined in gmond.conf as belonging to their named cluster. The Windows machine is automatically lumped into that same cluster – how do I define a cluster group for an sFlow node? 3. If I create a conf.php override file as recommended and put $conf['case_sensitive_hostnames'] = false; in there, I don’t get any graphs displayed at all for anything. Remove the file and the graphs come back. What am I doing wrong with conf.php? Many thanks Simon Ambridge -- Slashdot TV. Video for Nerds. Stuff that Matters. http://pubads.g.doubleclick.net/gampad/clk?id=160591471iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gauges verses counters
FYI, Ganglia already understands the output from this alternative JMX monitoring solution: https://code.google.com/p/jmx-sflow-agent/ I think it has similar properties to embedded-jmxtrans. Much better to have the JVM push the stats every 20 seconds or so than have to poll for them remotely over an encrypted connection. And using the java-agent hook means you only have to change the JVM command-line. Neil On Thu, Mar 13, 2014 at 11:12 AM, Silver, Jonathan jonathan.sil...@unify.com wrote: We are planning on using jmxtrans to collect and propagate a number of metrics to ganglia. There is no place in jmxtrans to define the metric as a counter or a gauge. If we do NOT predefine the metric in rrds, what will happen? What will show on the graphs? How does ganglia know that it's a gauge and not a counter? Thanks, jon -- Learn Graph Databases - Download FREE O'Reilly Book Graph Databases is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Learn Graph Databases - Download FREE O'Reilly Book Graph Databases is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] NGINX / SFLOW / Ganaglia - metrics get corrupted
Mark, It does seem like the issue is with the sFlow from nginx-sflow-module. I wrote that module so I can probably help: (1) just one instance of nginx on that server, or two? (2) what version of nginx? (3) single-threaded or multi-threaded nginx? (4) running on Linux OS? (5) please upgrade to the latest nginx-sflow-module (0.9.8), the one you are running (0.9.7) has a bug that affects graceful restarts. The fix was a one-liner, so it's not a big step. (6) please capture and send a trace of the sFlow packets arriving from this nginx source. For example, if the IP address is 10.1.2.3 and it's coming in on eth0: root /usr/sbin/tcpdump -i eth0 -s 0 -w nginx_sflow.pcap udp port 6343 and ip src 10.1.2.3 control-c after a few minutes to stop root gzip nginx_sflow.pcap then send nginx_sflow.pcap.gz (7) please also send /etc/hsflowd.conf The kind of thing it might be: - two nginx-sflow-modules running on the same host and not disambiguating properly (supposed to happen automatically by choosing sflow datasource index as lowest numbered TCP port number that process is listening on) Regards, Neil On Fri, Mar 7, 2014 at 3:40 PM, Bernard Li bern...@vanhpc.org wrote: Can you connect to the gmond port and paste the XML for the metrics in question? I'd like to see how they're defined. Thanks, Bernard On Fri, Mar 7, 2014 at 11:08 AM, Flanagan, Mark mark.flana...@unify.com wrote: http://www.sflow.org/ appears to be the defining entity for sflow. http://www.sflow.org/sflow_http.txt would appear to define the http sflow data. It is not explicitly clear just what the counter values are supposed to mean. The general architecture of sflow-like data would suggest the values should be a running counter (like the network interface metrics) which means gmond is implementing the packets properly and NGINX is sending the wrong data. That's just my guess for now. -Original Message- From: Bernard Li [mailto:bern...@vanhpc.org] Sent: Friday, March 07, 2014 1:39 PM To: Silver, Jonathan Cc: ganglia-general@lists.sourceforge.net; Flanagan, Mark Subject: Re: [Ganglia-general] NGINX / SFLOW / Ganaglia - metrics get corrupted Hi Jonathan: Perhaps you can share how these metrics are defined? Cheers, Bernard On Fri, Mar 7, 2014 at 10:21 AM, Silver, Jonathan jonathan.sil...@unify.com wrote: Does the following analysis mean anything to anyone? It seems to me that this is a basic thing that should have been seen by everyone else and found during first test - unless it's some config parameter. Thanks Jon --- Well, I think I understand what is happening - but I don't even want to think about fixing it. I'm not sure which software is right. The sflow data coming from NGINX reports the number of various HTTP messages (GET, HEAD, 1XX, 2XX, etc) in the measured period. The period is either 10 or 20 seconds - I don't have any idea why that isn't consistent. When gmond receives the HTTP data in sflow format, it computes the difference between the most recently reported value and the one before and divides that by the reported interval. That is, it is expecting a running total and that is NOT what is received. I don't know which software is right, but the NGINX reports are not what the gmond handler expects. All the other sflow reports appear to be correct. -- Mark Flow plug-in: I am still trying to find out, it is actually built by another group and I'm not sure what they pulled, but I'm pretty sure its 0.9.8 hsflowd version 1.23.2 gmond 3.6.0 - On Tuesday, 4 March 2014, Silver, Jonathan jonathan.sil...@unify.com wrote: We're using NGINX and sflow, to capture and send the metrics to ganglia. The metric values look correct when viewed using sflowtool, but gmond (on the same box)is reporting them with all kinds of random values. Running gmond --debug=10 I do see some various error messages in the log: Some of these: sequence number error - 10.235.240.31:443-3:443 lostSamples=37 Some of these: ERROR: [Errno 111] Connection refused And some with the hostname NULL: (But only one time for each metric) ***Allocating value packet for host--(null)-- and metric --http_meth_put-- Has anyone heard of this issue? I've started adding debug statements to gmond, but before I go through all of that, if it's a known issue. Thanks for any info, jon -- Subversion Kills Productivity. Get off Subversion Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the
Re: [Ganglia-general] sflow - getting VirStorageLookupByPAth failed.
Ron, You might try downloading the latest source code for hsflowd, and compiling with LIBVIRT=yes VRTDSKPATH=yes In other words: svn checkout http://svn.code.sf.net/p/host-sflow/code/trunk host-sflow-code cd host-sflow-code make LIBVIRT=yes VRTDSKPATH=yes This turns on a different way of accessing the storage info. For details, see here: http://sourceforge.net/p/host-sflow/code/398/tree/trunk/src/Linux/hsflowd.c around line 634. Please let me know if this works better. Neil On Mar 12, 2013, at 3:57 PM, Ron wrote: First attempt to configure sflow. I have numerous KVM/QEMU VMs running My understanding was to use sflow to collect metrics for these. Configured a gmond as an sflow 'collector'. Altered DNS to point to the collector. But now for each VM I get: something like: hsflowd: virStorageLookupByPath(/panfs/pan5/data/VMStorage/PiraatTriple.img) failed But, the file is there. ls -l /panfs/pan5/data/VMStorage/PiraatTriple.img -rw-rw 1 qemu qemu 17179869184 Mar 12 10:38 /panfs/pan5/data/VMStorage/PiraatTriple.img Anybody bumped into this before? TIA Ron Reeder -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] calculate cpu utilization with cpu time
The sFlow CPU metrics are processed here: https://github.com/ganglia/monitor-core/blob/master/gmond/sflow.c#L334 Let me know if you find a problem. Regards, Neil On Aug 10, 2012, at 2:00 AM, crayon z wrote: Hi, all: I use ganglia to parse metrics from Host sFlow. The cpu metrics in Host sFlow are in form of CPU time, however, I want to know how ganglia calculate cpu utilization with cpu time. Best Regards -- Crayon Z -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Impact of gmond polling on data collection
in gmond.c:process_tc_accept_channel() could those goto statements close the socket and return without relinquishing the mutex? Neil On Sep 19, 2012, at 8:45 AM, Nicholas Satterly wrote: Hi Peter, Thanks for the feedback. I've added a thread mutex to the hosts hash table as you suggested and will send a pull request in the next day or so. Regards, Nick On Mon, Sep 17, 2012 at 8:25 PM, Peter Phaal peter.ph...@gmail.com wrote: Nicholas, It makes sense to multi-thread gmond, but looking at your patch, I don't see any locking associated with the hosts hashtable. Isn't there a possible race if new hosts/metrics are added to the hashtable by the UDP thread at the same time the hashtable is being walked by the TCP thread? Peter On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Chris, I've discovered there are two contributing factors to problems like this. 1. the number of metrics being sent (possibly in short bursts) can overflow the UDP receive buffer. 2. the time it takes to process metrics in the UDP receive buffer causes TCP connections from the gmetad's to timeout (currently hard-coded to 10 seconds) In your case, you are probably dropping UDP packets because gmond can't keep up. Gmond was enhanced to allow you to increase the UDP buffer size back in April. I suggest you upgrade to the latest version and set this a sensible value for your environment. udp_recv_channel { port = 1234 buffer = 1024000 } To determine what is sensible is a bit of trial and error. Run netstat -su and keep increasing the value until you no longer see the number of packet receive errors going up. $ netstat -su Udp: 7941393 packets received 23 packets to unknown port received. 0 packet receive errors 10079118 packets sent The other possibility is that it takes so long for a gmetad to pull back all the metrics you are collecting for a cluster that you are preventing the gmond from processing metric data received via UDP. Again this can cause the UDP receive buffer to overflow. The problem we had at my work is related to all of the above but manifested itself in a slightly different way. We were seeing gaps in all our graphs because at times none of the servers in a cluster would respond to gmetad poll within 10 seconds. I used to think that the gmond was completely hung but realised that they would respond normally most of the time but every minute or so it woul take about 20-25 seconds. This happened to coincide with the UDP receive queue growing (Recv-Q column below) and I realised that it took this long for the gmond to process the metric data it had received via UDP from all the other servers in the cluster. $ netstat -ua Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address udp 1920032 0 *:8649 *:* The solution was to modify gmond and move the TCP request handler into to separate thread so that gmond could take as long as it needed to process incoming metric data (from UDP receive buffer that is large enough not to overflow) without blocking on the TCP requests for the XML data. The patched gmond is running without a problem in our environment so I have submitted a pull request[1] for it to be included in trunk. I can't be 100% sure that this patch will fix your problem but it would be worth a try. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/50 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com wrote: We use ganglia to monitor 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting. One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far too often. We set up two parallel gmetad instances (monitoring identical gmonds) per DC and have broken our problem into two classes: * (A) only one of the gmetad stops updating for an entire cluster, and must be restarted to recover. Since the gmetad's disagree we know the problem is there. [1] * (B) Both gmetad's say an individual host has not reported (gmond aggregation or sending must be at fault). This issue is usually transient (that is it recovers after some period of time greater than 10 minutes). While attempting to reproduce (A) we ran several additional gmetad instances (again polling the same gmonds) around 2012-12-07. Failures per day are below [2]. The act of testing seems to have significantly increased the number of
Re: [Ganglia-general] Gmond Compilation on Cygwin
You could try adding --disable-sflow as another configure option. (Or were you planning to use sFlow agents such as hsflowd?). Neil On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote: Ganglia 3.4.0 Windows 2008 R2 Enterprise Cygwin 1.5.25 IBM iDataPlex dx360 with Tesla M2070 Confuse 2.7 I’m trying to use the Ganglia Python modules to monitor a Windows based GPU cluster, but having problems getting gmond to compile. This ‘configure’ completes successfully ./configure --with-libconfuse=/usr/local --without-libpcre --enable-static-build but ‘make’ fails, this is the tail of standard output mv -f .deps/g25_config.Tpo .deps/g25_config.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP -MF .deps/core_metrics .Tpo -c -o core_metrics.o core_metrics.c mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF .deps/sflow.Tpo -c -o sfl ow.o sflow.c sflow.c: In function `process_struct_JVM': sflow.c:1033: warning: comparison is always true due to limited range of data type sflow.c:1034: warning: comparison is always true due to limited range of data type sflow.c:1035: warning: comparison is always true due to limited range of data type sflow.c:1036: warning: comparison is always true due to limited range of data type sflow.c:1037: warning: comparison is always true due to limited range of data type sflow.c:1038: warning: comparison is always true due to limited range of data type sflow.c:1039: warning: comparison is always true due to limited range of data type sflow.c: In function `processCounterSample': sflow.c:1169: warning: unsigned int format, uint32_t arg (arg 4) sflow.c:1169: warning: unsigned int format, uint32_t arg (arg 4) sflow.c: In function `process_sflow_datagram': sflow.c:1348: error: `AF_INET6' undeclared (first use in this function) sflow.c:1348: error: (Each undeclared identifier is reported only once sflow.c:1348: error: for each function it appears in.) make[3]: *** [sflow.o] Error 1 make[3]: Leaving directory `/var/tmp/ganglia-3.4.0/gmond' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/var/tmp/ganglia-3.4.0/gmond' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/var/tmp/ganglia-3.4.0' make: *** [all] Error 2 Has anyone come across this before ? Many Thanks Nigel ___ This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is prohibited. Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for additional disclosures. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] hsflowd ported to Solaris
Hello All, There is now a Solaris port of hsflowd: http://host-sflow.sourceforge.net Binary packages for sparc and x86 can be downloaded, but sources are only in the trunk: mkdir host-sflow-trunk svn co https://host-sflow.svn.sourceforge.net/svnroot/host-sflow/trunk host-sflow-trunk more host-sflow-trunk/INSTALL.SunOS Some Ganglia+sFlow explanation here: http://blog.sflow.com/2011/07/ganglia-32-released.html Thanks go to Johnny Johnson for contributing the port. If you run Solaris your feedback would be very much appreciated. Neil-- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] udp_recv_channel for sflow and gmetric
I'm pretty sure this will not work. You need separate ports. Neil Mckee On Mar 28, 2012, at 2:45 PM, Ozzie Sabina o...@sabina.org wrote: Can this be shared? A quick googling failed me here. Can I configure a single one of these and accept messages from both gmetric and sflow clients? We use a port per-service and run multiple gmonds per machine, so it's considerably simpler to only use the one port as we have the infrastructure in place for that. To be explicit, if I do: globals { mute = no deaf = no ... } udp_recv_channel { port = 15010 } sflow { udp_port = 15010 } (a) is that sufficient alone (with gmond 3.3.1) to start collecting slow metrics being spit at me, and (b) will I be able to also send gmetric values to the same port? Oz -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Free Velocity Online conference tomorrow
I guess the sFlow network interface counters could go in as well - but with the current model I think we would have to flatten the data so that every interface looked like a separate host in the Ganglia database. Is that really what you want? It seems like this is part of the discussion about naming, tagging and parent-child hierarchy that came up with the introduction of hypervisors and their VMs. Any more thoughts on that? Neil On Oct 25, 2011, at 12:36 PM, Vladimir Vuksan wrote: Great. I am often asked about network devices such as switches and routers. What is the roadmap on that ? Thanks, Vladimir On Tue, 25 Oct 2011, Neil Mckee wrote: Vladimir, Just an FYI since it seems to be relevant to your talk: I am preparing a patch for Ganglia that will add support for the sFlow-HTTP feed, as exported by mod-sflow, nginx-sflow-module, tomcat-sflow-valve and node-sflow-module. This represents an efficient way to get real-time HTTP stats from a large web-farm. The sFlow-HTTP spec should be finalized in the next few weeks, so the patch can go in soon after that. background info here: http://blog.sflow.com/search?q=HTTP discussion on sFlow-HTTP spec (please comment!): http://groups.google.com/group/sflow/browse_thread/thread/88accb2bad594d1d# source code links: http://host-sflow.sourceforge.net/relatedlinks.php Regards, Neil P.S. sFlow-MEMCACHE support will probably be added to Ganglia at the same time. On Oct 25, 2011, at 8:47 AM, Vladimir Vuksan wrote: I was gonna mention there is a free Velocity online conference/webcast. I will be speaking about backend monitoring and time permitting will be demoing some of the Ganglia Web 2.0 features. http://velocityconf.com/velocity-oct2011 Vladimir -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] vlan traffic counting
Thanks for bringing this up. I checked a change into the hsflowd trunk that looks for these interfaces and excludes them from the counting. It uses the SIOCGIFVLAN ioctl call -- although it seems that your filter on the device name might work just fine. http://host-sflow.svn.sourceforge.net/viewvc/host-sflow/trunk/src/Linux/readInterfaces.c?annotate=231 Neil On Aug 29, 2011, at 6:39 AM, Robin Humble wrote: I've noticed on our VLAN interfaces that gmond's default network metric seem to be miscalculating network traffic. anyone else seeing this? we don't use many VLANs. seems the Linux OS counters for the VLAN interface also add onto on the parent interface, and gmond reads both so traffic reported by gmond is ~2x greater than it really is. eg. eth4 (no IP set) with a eth4.99 VLAN, /proc/net/dev shows Inter-| Receive| Transmit face |bytespackets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed ... eth4:1453106293850 1688242724000 0 0 11874 3090232715347 2518006182000 0 0 0 eth4.99:1429470895714 1688242724000 0 0 11874 2988353281655 912706240000 0 0 0 maybe we have setup our interfaces oddly or something. I don't know why Tx Pkts is different between the 2 interfaces ... maybe an upstream MTU. aliased interfaces don't have the same problem as Linux doesn't list them in /proc/net/dev. our setup is ganglia 3.2.0, x86_64, centos5.6 userland, 2.6.32 vanilla kernels, ixgbe 10gige. the below patch fixes/hacks-around the problem by simply skipping all VLAN interfaces - anything with a '.' in the name. doesn't seem right somehow, but seems to work for me. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility --- ganglia-3.2.0.orig/libmetrics/linux/metrics.c 2010-05-11 00:39:54.0 +1000 +++ ganglia-3.2.0/libmetrics/linux/metrics.c 2011-08-29 16:19:55.0 +1000 @@ -181,8 +181,10 @@ void update_ifdata ( char *caller ) p = index(p, ':'); /* Ignore 'lo' and 'bond*' interfaces (but sanely) */ + /* Ignore VLAN interfaces (eg. eth4.99) as stats are already included in parent */ if (p strncmp (src, lo, 2) - strncmp (src, bond, 4)) + strncmp (src, bond, 4) + (index(src,'.') == NULL || index(src, '.') p)) { p++; /* Check for data from the last read for this */ -- EMC VNX: the world's simplest storage, starting under $10K The only unified storage solution that offers unified management Up to 160% more powerful than alternatives and 25% more efficient. Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Malware Security Report: Protecting Your Business, Customers, and the Bottom Line. Protect your business and customers by understanding the threat from malware and how it can impact your online business. http://www.accelacomm.com/jaw/sfnl/114/51427462/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Problem displaying Virtual Machine data with hsflowd and ganglia 3.2.0 in an Openstack Compute node.
On Aug 31, 2011, at 8:15 AM, Emanuele Verga wrote: Hi Neil, thanks a lot for the help! I verified libvirt version, it's 0.8.8 I've dowloaded and compiled hsflowd revision 227: now ganglia correctly receives and processes statistics VM Bytes Written and VM Writes. (http://imageshack.us/f/846/vmstats.png/) Other disk statistics for VM ( VM Bytes Read, VM Disk Errors, Free Vdisk Space, VM Reads, Total Vdisk Space) are not displayed, and hsflowd -dd displays errors similar to the following: libvir: QEMU error : invalid argument in invalid path vda not assigned to domain virDomainGetBlockInfo(vda) failed Oh, sorry. virDomainetBlockInfo needs the path, not the deviceName. I've made the (1-line) change and checked it in. Please svn update and try again. If this works you should start to see the capacity, allocation and available fields. I'm not sure whether disk errors are going to show up or not. It depends on how libvirt implements the virDomainBlockStats() call for KVM. By the way, if you don't want it to even attempt that other call that always fails to find the volume, you can add this somewhere in the src/Linux/Makefile: CFLAGS += -DHSP_VRT_USE_DISKPATH But I think we might change it so that the error message only appears once in future (don't want to fill the logs), so that might be just as good. Neil I've logged a few minutes of hsflowd activity, if it can help you you can download it here: http://www.mediafire.com/?g4jac7dm3mmb662 2011/8/29 Neil Mckee neil.mckee...@gmail.com Sorry, the failure of virStorageLookupByPath() was preventing virDomainBlockStats() from being attempted. I checked in a fix for this, and also code to try the newer virDomainGetBlockInfo() call as a fallback should virStorageLookupByPath() fail. This call only came in with libvirt version 0.8.1. Are you running something newer than that? (see /usr/include/libvirt/libvirt.h) If this works, we should make a new release of hsflowd, so please let me know how it goes. Regards, Neil On Aug 29, 2011, at 7:24 AM, Emanuele Verga wrote: Hi, I downloaded and installe hsflowd trunk revision 226 but using hsflowd I keep seeing virStorageLookupByPath errors, and VM disk statistics aren't displayed. Do I need to tell hsflowd explicitly to use target=vda call? If yes, how? Thanks in advance, Emanuele 2011/8/25 Emanuele Verga verga.emanu...@gmail.com Hi Neil, Yes that's possible, the problem is Nova places each image in a separate folder (/var/lib/nova/instance/INSTANCENAME/), so we would have to create a new pool with the corresponding path each time a new instance is created, and if we start to add more servers it quicly becomes impractical. I've not yet been able to try the hsflowd version you suggested, I'll test it tomorrow and let you know. Thanks for the help! Emanuele -- Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free Love Thy Logs t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Problem displaying Virtual Machine data with hsflowd and ganglia 3.2.0 in an Openstack Compute node.
Sorry, the failure of virStorageLookupByPath() was preventing virDomainBlockStats() from being attempted. I checked in a fix for this, and also code to try the newer virDomainGetBlockInfo() call as a fallback should virStorageLookupByPath() fail. This call only came in with libvirt version 0.8.1. Are you running something newer than that? (see /usr/include/libvirt/libvirt.h) If this works, we should make a new release of hsflowd, so please let me know how it goes. Regards, Neil On Aug 29, 2011, at 7:24 AM, Emanuele Verga wrote: Hi, I downloaded and installe hsflowd trunk revision 226 but using hsflowd I keep seeing virStorageLookupByPath errors, and VM disk statistics aren't displayed. Do I need to tell hsflowd explicitly to use target=vda call? If yes, how? Thanks in advance, Emanuele 2011/8/25 Emanuele Verga verga.emanu...@gmail.com Hi Neil, Yes that's possible, the problem is Nova places each image in a separate folder (/var/lib/nova/instance/INSTANCENAME/), so we would have to create a new pool with the corresponding path each time a new instance is created, and if we start to add more servers it quicly becomes impractical. I've not yet been able to try the hsflowd version you suggested, I'll test it tomorrow and let you know. Thanks for the help! Emanuele -- Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free Love Thy Logs t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Problem displaying Virtual Machine data with hsflowd and ganglia 3.2.0 in an Openstack Compute node.
On Aug 18, 2011, at 1:35 AM, Emanuele Verga wrote: Ok, I tried linking one of the disk files to the default storage pool folder and it actually detected the linked volume in libvirt: After issuing a virsh pool-refresh default the disk was correctly detected and reported as a volume by virsh vol-list default, but there is a problem: The disk is added as a volume in libvirth with the path parameter corresponding to the soft link ( for example: /var/lib/libvirt/images/instance.002d_disk) instead of the disk path ( /var/lib/nova/instances/instance-002d/disk ), this means that disk lookups performed by hsflowd still fail, because they try to retrieve volume information associated to disk path. (example: virStorageLookupByPath(/var/lib/nova/instances/instance-002d/disk) ). Hsflowd gets path information from the Virtual Machine XML definition (virsh dumpxml instance.002d shows, among many other details, the following line: source file='/var/lib/nova/instances/instance-002d/disk'/), that is generated and stored automatically by openstack into libvirt.xml (ex: /var/lib/nova/instances/instance-002d/libvirt.xml). So, to make it work this way, we should have a way to tell Nova/Openstack which path to look into to retrieve VM disks, and to create the related soft links into the appropriate folder, when provisioning a new instance. I also tried using virt.manager to create the pool but it didn't work, the pool was created but no disk was detected, i suppose because libvirt expect volumes to be located right into the folder specified as pool, and doesn't look in any subfolder (creating the pool manually didn't work for that reason). So, was it not possible to specify the pool directory to be the one where the disk image was actually residing? That worked for me when I tried it here. I have a disk image: /root/not_libvirt_images/test-pool2.img and I was able to add it to a new storage pool called alternative using virt-manager. So in virsh I can ask for pool-dumpxml alternative, like this: virsh # pool-dumpxml alternative pool type='dir' namealternative/name uuid51341d87-d87e-f7ce-6cc0-81ac3967c182/uuid capacity233620566016/capacity allocation37986648064/allocation available195633917952/available source /source target path/root/not_libvirt_images/path permissions mode0700/mode owner0/owner group0/group /permissions /target /pool Something I noticed that may be important is that virt-manager is able to display disk statistics for those VM. I don't really know how it gets those informations, but I believe it accesses disks using the details contained into disk /disk tags located into the VM XML definition, instead of doing a volume lookup starting from the volume path, like hsflowd is trying to do. Example: disk type='file' device='disk' driver name='qemu' type='qcow2'/ source file='/var/lib/nova/instances/INSTANCENAME/disk'/ target dev='vda' bus='virtio'/ alias name='virtio-disk0'/ address type='pci' domain='0x' bus='0x00' slot='0x04' function='0x0'/ /disk Could this be used in some way? There is a libvirt call that takes the target=vda device name and returns reads/writes counters. This was just added to hsflowd. However it's not released yet, so you'd have to check out the trunk using subversion and build from that: svn co https://host-sflow.svn.sourceforge.net/svnroot/host-sflow/trunk host-sflow-trunk I'm not sure if the same handle can be used to retrieve the capacity, allocation and available numbers, though. (If there is a libvirt expert on the list, please jump in and set us straight.) Neil -- Get a FREE DOWNLOAD! and learn more about uberSVN rich system, user administration capabilities and model configuration. Take the hassle out of deploying and managing Subversion and the tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Problem displaying Virtual Machine data with hsflowd and ganglia 3.2.0 in an Openstack Compute node.
I don't think the soft-links will work, but try this: 1) go back to compiling hsflowd for libvirt. 2) using virt-manager or equivalent, tell it about the storage pool of type filesystem directory at /var/lib/nova/instances. See details here: http://virt-manager.org/page/StorageManagement (To get to this virt-manager screen you need to select EditHost Details) Now you should see it with virsh pool-list. If you called it nova then virsh pool-info nova will show something like this: virsh # pool-info nova Name: noca UUID: 51341d87-d87e-f7ce-6cc0-81ac3967c182 State: running Capacity: 217.58 GB Allocation: 35.38 GB Available: 182.20 GB hsflowd should then be able to pull those Capacity, Allocation and Available values and send them to Ganglia. Neil P.S. It looks as though hsflowd is not filling in the disk reads/writes/errors counters for KVM VMs yet. If you know of an efficient way to do that, please suggest it on the hsflowd mailing list: https://lists.sourceforge.net/lists/listinfo/host-sflow-discuss On Aug 17, 2011, at 7:16 AM, santosh gangwani wrote: try having soft links using ln command, may or may not work though :) ln -s On Wed, Aug 17, 2011 at 4:17 PM, Emanuele Verga verga.emanu...@gmail.com wrote: Hi Neil, thanks for the suggestion. I tried to do it. After recompiling hsflowd i checked the files it had open and it showed: hsflowd 30019 nobody mem REG 251,3 84728 927248 /usr/lib/libxenctrl.so.3.2.0 hsflowd 30019 nobody mem REG 251,3 22928 927250 /usr/lib/libxenstore.so.3.0.0 but this solution didn't work. Virtual machines stopped showing up in ganglia and hsflowd gave this error: ERROR Internal error: Could not obtain handle on privileged command interface (2 = No such file or directory) xc_interface_open() failed : No such file or directory then it continued to work normally (hsflowd log file: http://uploading.com/files/get/a3876c8b/ ) but it didn't report anything to ganglia and the VMS were shown as down, the reporting for the physical host instead it's working perfectly, as before. If it can help you, the implementation of Openstack we have uses KVM to virtualize the hosts. Could it be related? Thanks again, Emanuele 2011/8/16 Neil Mckee neil.mckee...@gmail.com Hello, On an OpenStack node you may be able to use libxenstore instead of libvirt. You'll need to recompile hsflowd to try this. Looking at trunk/src/Linux/Makefile it appears to look for libvirt first, but you can override that by compiling hsflowd like this: make clean make LIBVIRT=no The Makefile will then test for libxenstore and libxenctrl. If it finds them it will compile with -DHSF_XEN (instead of -DHSF_VRT), and you may get better results. Please let me know what happens. Neil On Aug 16, 2011, at 2:14 AM, Emanuele Verga wrote: Hi, we have a problem with the following installation: we have a system that’s a compute node in an Openstack test installation. Now on this machine we decided to install Ganglia, to check it’s monitoring capabilities regarding virtual machines hosted on that node by Openstack. We then proceeded to add the repository for ganglia version 3.2 and install: Hsflowd 1.18 Ganglia Monitor Demon 3.2.0.0 Ganglia Meta demon 3.2.0.0 Ganglia Web FronEnd3.2.0.0 Dwoo1.1.1 all on the same machine and to configure ports accordingly. All said and done, the web frontend shown the physical host and all of the VM, but we were unable to: See the hypervisor section in the physical host statistics. It simply is not there. See the graphical preview for VM statistics (http://imageshack.us/photo/my-images/585/screenshotgangliacomput.png/). The thumbnail are missing but the links do work, clicking on one of the missing thumbnails you are taken to the details page for that VM. See details regarding VM hard disk and I/O. (http://imageshack.us/photo/my-images/694/screenshotgangliainstan.png/) Debugging hsflowd we found errors similar to the following: Aug 16 06:26:57 eta hsflowd: virStorageLookupByPath(/var/lib/nova/instances/instance-004d/disk.local) failed Aug 16 06:26:57 eta hsflowd: virStorageLookupByPath(/var/lib/nova/instances/instance-004d/disk) failed The strange thing is, the path is correct. We checked libvirt using virsh and found that the Storage Volumes are not reported by libvirt, it seems because libvirt by default search informations in path /var/lib/libvirt/images, instead nova places them inside /var/lib/nova/instances/INSTANCENAME/. Did you have the same problems when testing for Sflow/ Openstack ? How did you manage to resolve it? Any help is appreciated. Thanks in advance for your support
Re: [Ganglia-general] Problem displaying Virtual Machine data with hsflowd and ganglia 3.2.0 in an Openstack Compute node.
Hello, On an OpenStack node you may be able to use libxenstore instead of libvirt. You'll need to recompile hsflowd to try this. Looking at trunk/src/Linux/Makefile it appears to look for libvirt first, but you can override that by compiling hsflowd like this: make clean make LIBVIRT=no The Makefile will then test for libxenstore and libxenctrl. If it finds them it will compile with -DHSF_XEN (instead of -DHSF_VRT), and you may get better results. Please let me know what happens. Neil On Aug 16, 2011, at 2:14 AM, Emanuele Verga wrote: Hi, we have a problem with the following installation: we have a system that’s a compute node in an Openstack test installation. Now on this machine we decided to install Ganglia, to check it’s monitoring capabilities regarding virtual machines hosted on that node by Openstack. We then proceeded to add the repository for ganglia version 3.2 and install: Hsflowd 1.18 Ganglia Monitor Demon 3.2.0.0 Ganglia Meta demon 3.2.0.0 Ganglia Web FronEnd3.2.0.0 Dwoo1.1.1 all on the same machine and to configure ports accordingly. All said and done, the web frontend shown the physical host and all of the VM, but we were unable to: See the hypervisor section in the physical host statistics. It simply is not there. See the graphical preview for VM statistics (http://imageshack.us/photo/my-images/585/screenshotgangliacomput.png/). The thumbnail are missing but the links do work, clicking on one of the missing thumbnails you are taken to the details page for that VM. See details regarding VM hard disk and I/O. (http://imageshack.us/photo/my-images/694/screenshotgangliainstan.png/) Debugging hsflowd we found errors similar to the following: Aug 16 06:26:57 eta hsflowd: virStorageLookupByPath(/var/lib/nova/instances/instance-004d/disk.local) failed Aug 16 06:26:57 eta hsflowd: virStorageLookupByPath(/var/lib/nova/instances/instance-004d/disk) failed The strange thing is, the path is correct. We checked libvirt using virsh and found that the Storage Volumes are not reported by libvirt, it seems because libvirt by default search informations in path /var/lib/libvirt/images, instead nova places them inside /var/lib/nova/instances/INSTANCENAME/. Did you have the same problems when testing for Sflow/ Openstack ? How did you manage to resolve it? Any help is appreciated. Thanks in advance for your support! -- uberSVN's rich system and user administration capabilities and model configuration take the hassle out of deploying and managing Subversion and the tools developers use with it. Learn more about uberSVN and get a free download at: http://p.sf.net/sfu/wandisco-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Get a FREE DOWNLOAD! and learn more about uberSVN rich system, user administration capabilities and model configuration. Take the hassle out of deploying and managing Subversion and the tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] missing many samples with host-sflow...
500 nodes sending sFlow-HOST data is probably only about 25 packets/sec, so the issue here is unlikely to be a performance bottleneck in terms of CPU, network bandwidth, UDP buffers etc. Right now the most likely explanation seems to be some race-condition over how long before gmond considers the data to be stale. In the function sflow.c: process_sflow_gmetric() we have this: gfull-metric.tmax = 60; /* (secs) poll if it changes faster than this */ gfull-metric.dmax = 0; /* (secs) how long before stale? */ I was under the impression that setting dmax to 0 is supposed to mean that the data does not expire at all, but maybe this assumption is wrong? Please confirm that you are running hsflowd with a polling-interval set to 30 seconds or less, and please confirm that the CPU is not busy. The other step we could take is to log the values of lostDatagrams and lostSamples when the debug level is set on the command line (these counters that are maintained within sflow.c but not logged at the moment). That would help to confirm or deny if there is any bottleneck in the front end. The gmond process blocks while the XML data is being extracted. So if you were extracting the XML data over a slow link to a slow device and it took a number of seconds to transfer, then you might conceivably lose packets due to the UDP input buffer overflowing during that time. If that is happening it will show up in the lostDatagrams counter. The workaround might just be to ioctl() the input socket buffer to a bigger size. I've seen this bumped up from about 130K to over 2MB before, so that would buy more time without having to do anything more elaborate. Regards, Neil On Jul 21, 2011, at 12:32 PM, Robert Jordan wrote: I have a cluster with approximately 500 nodes reporting via host-sflow to a single gmond. In the past few days my graphs have started to look like dotted lines and most of the time ganglia reports all of the nodes as down. Has anyone seen similar issues? -- 5 Ways to Improve Secure Unified Communications Unified Communications promises greater efficiencies for business. UC can improve internal communications as well as offer faster, more efficient ways to interact with customers and streamline customer service. Learn more! http://www.accelacomm.com/jaw/sfnl/114/51426253/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- 10 Tips for Better Web Security Learn 10 ways to better secure your business today. Topics covered include: Web security, SSL, hacker attacks Denial of Service (DoS), private keys, security Microsoft Exchange, secure Instant Messaging, and much more. http://www.accelacomm.com/jaw/sfnl/114/51426210/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] missing many samples with host-sflow...
Upon investigation we found that a handful of the nodes were sending with sFlow-agent-address == 0.0.0.0. These nodes boot using DHCP so this may be a race where the hsflowd daemon starts before the IP address has been learned. The fix will be to make hsflowd wait until it has a current IP address before sending (and check for changes periodically). And at the gmond end, we should probably add a check to ignore any datagrams that have sFlow-agent-address==0.0.0.0. Because multiple nodes were sending with the same agent address the affect was to alias their data together so that it looked like successive readings from the same node. Most of the time the resulting sequence number deltas were such that the data was being ignored anyway, but as clocks drift over time it's possible that some readings would get through and result in astronomically high deltas being recorded.If that happened and these large deltas were enough to trip a sanity-check somewhere further on (perhaps in gmetad), then that could explain how the gaps appeared in the chart for the whole cluster. Neil On Jul 22, 2011, at 1:06 PM, Neil Mckee wrote: 500 nodes sending sFlow-HOST data is probably only about 25 packets/sec, so the issue here is unlikely to be a performance bottleneck in terms of CPU, network bandwidth, UDP buffers etc. Right now the most likely explanation seems to be some race-condition over how long before gmond considers the data to be stale. In the function sflow.c: process_sflow_gmetric() we have this: gfull-metric.tmax = 60; /* (secs) poll if it changes faster than this */ gfull-metric.dmax = 0; /* (secs) how long before stale? */ I was under the impression that setting dmax to 0 is supposed to mean that the data does not expire at all, but maybe this assumption is wrong? Please confirm that you are running hsflowd with a polling-interval set to 30 seconds or less, and please confirm that the CPU is not busy. The other step we could take is to log the values of lostDatagrams and lostSamples when the debug level is set on the command line (these counters that are maintained within sflow.c but not logged at the moment). That would help to confirm or deny if there is any bottleneck in the front end. The gmond process blocks while the XML data is being extracted. So if you were extracting the XML data over a slow link to a slow device and it took a number of seconds to transfer, then you might conceivably lose packets due to the UDP input buffer overflowing during that time. If that is happening it will show up in the lostDatagrams counter. The workaround might just be to ioctl() the input socket buffer to a bigger size. I've seen this bumped up from about 130K to over 2MB before, so that would buy more time without having to do anything more elaborate. Regards, Neil On Jul 21, 2011, at 12:32 PM, Robert Jordan wrote: I have a cluster with approximately 500 nodes reporting via host-sflow to a single gmond. In the past few days my graphs have started to look like dotted lines and most of the time ganglia reports all of the nodes as down. Has anyone seen similar issues? -- 5 Ways to Improve Secure Unified Communications Unified Communications promises greater efficiencies for business. UC can improve internal communications as well as offer faster, more efficient ways to interact with customers and streamline customer service. Learn more! http://www.accelacomm.com/jaw/sfnl/114/51426253/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Storage Efficiency Calculator This modeling tool is based on patent-pending intellectual property that has been used successfully in hundreds of IBM storage optimization engage- ments, worldwide. Store less, Store more with what you own, Move data to the right place. Try It Now! http://www.accelacomm.com/jaw/sfnl/114/51427378/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Network bytes spikes
I checked the sFlow feed, and it looks like the sanity checks for 32-bit rollover and impossible-counter-delta are already present in the hsflowd code (host-sflow.sourceforge.net src/Linux/readNioCounters.c). At least for the Linux and FreeBSD ports anyway. We should add those checks to the Windows port. Always better to clean things up at the source if you can. That makes it less urgent to add the same sanity checks at the receiver end (monitor-core/gmond/sflow.c). Sanity checks in too many places could cause headaches down the line (e.g when we all have 10Tbps links). I apologize if this is too much information about a feature that is only available if you compile the Ganglia trunk from sources, but for the record: (1). The 32-bit rollover problem is handled in hsflowd by polling faster internally (every 3 seconds). This accumulates 64-bit versions of the counters which are then pushed out at the normal polling frequency (typically 20 seconds). If the code detects that the kernel counters are already 64-bit, then it turns off the 3-second polling. (2). The impossible-counter-delta sanity checks in hsflowd depend on whether the field is 32-bit or 64-bit. The upper limit for a 32-bit counter delta is 0x7FFF (about 2e9) and for a 64-bit counter it is 1e13. These checks are applied to the frames and bytes counters, but if either check fails then the sequence number is reset for the whole counter-block -- which invalidates all the counter-deltas for that polling-interval. In other words, if the bytes_in counter jumps crazily then we won't believe the frames, errors or drops counters either. looking at libmetrics/linux/metrics.c, it does seem that compiling with -DREMOVE_BOGUS_SPIKES will do more or less the same as (2). Neil On Mar 30, 2011, at 5:56 PM, Bernard Li wrote: Hi all: On Tue, Mar 29, 2011 at 11:30 AM, Vladimir Vuksan vli...@veus.hr wrote: I see it all the time :-(. According to Bernard this is due to problem with some of the Broadcom cards. Perhaps Bernard can offer more insight. Some old threads which describe the issue in more detail: http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg04463.html http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg04245.html I see two solutions to this problem: 1) If this is indeed a driver issue, we should check to see if newer kernels can fix that. Perhaps Vladimir could look into this 2) It would probably be a good thing to implement sanity check. I think Neil is looking into implementing this for the sflow integration. Perhaps this could be extended for gmond data as well. To help resolve this issue, I would suggestion that we: 1) File a bug at bugzilla.ganglia.info 2) For all those affected, add comments to the bug providing the network driver model, module used, kernel version, OS version etc. Thanks! Bernard -- Create and publish websites with WebMatrix Use the most popular FREE web apps or write code yourself; WebMatrix provides all the features you need to develop and publish your website. http://p.sf.net/sfu/ms-webmatrix-sf ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Create and publish websites with WebMatrix Use the most popular FREE web apps or write code yourself; WebMatrix provides all the features you need to develop and publish your website. http://p.sf.net/sfu/ms-webmatrix-sf ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Ganglia and sFlow at Supercomputing 2010 in New Orleans
Hello all, Exhibiting at the Supercomputing 2010 show in New Orleans? Setting up a demo cluster? We are running a monitoring server in the SCinet NOC which is configured to receive sFlow from the show network. Selected pages will be shown on big screens all around the show floor and linked from the conference website. The server is running the latest gmond+Ganglia which accepts sFlow input. That means you can install the lightweight hsflowd daemon on your servers and have your cluster appear on the display too: http://host-sflow.sourceforge.net The server is already up and running: http://inmon.sc10.org I appreciate that getting a demo running on a trade-show booth can be tough enough, but I think you'll find this part is really easy to set up so it won't hold you back or slow you down if you decide to give it a try. You might be able to justify it a) because it's good publicity, or b) because it's just fun(!) Either way, please contact me at neil.mc...@inmon.com or come and ask for me at the NOC when you get there. Neil -- Nokia and ATT present the 2010 Calling All Innovators-North America contest Create new apps games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general