Re: [Ganglia-general] NGINX / SFLOW / Ganaglia - metrics get corrupted

2014-03-07 Thread Bernard Li
Can you connect to the gmond port and paste the XML for the metrics in
question?  I'd like to see how they're defined.

Thanks,

Bernard

On Fri, Mar 7, 2014 at 11:08 AM, Flanagan, Mark mark.flana...@unify.com wrote:
 http://www.sflow.org/ appears to be the defining entity for sflow.
 http://www.sflow.org/sflow_http.txt would appear to define the http sflow 
 data.

 It is not explicitly clear just what the counter values are supposed to 
 mean. The general architecture of sflow-like data would suggest the values 
 should be a running counter (like the network interface metrics) which means 
 gmond is implementing the packets properly and NGINX is sending the wrong 
 data.

 That's just my guess for now.


 -Original Message-
 From: Bernard Li [mailto:bern...@vanhpc.org]
 Sent: Friday, March 07, 2014 1:39 PM
 To: Silver, Jonathan
 Cc: ganglia-general@lists.sourceforge.net; Flanagan, Mark
 Subject: Re: [Ganglia-general] NGINX / SFLOW / Ganaglia - metrics get 
 corrupted

 Hi Jonathan:

 Perhaps you can share how these metrics are defined?

 Cheers,

 Bernard

 On Fri, Mar 7, 2014 at 10:21 AM, Silver, Jonathan
 jonathan.sil...@unify.com wrote:
 Does the following analysis mean anything to anyone?
 It seems to me that this is a basic thing that should have been seen by 
 everyone else and found during first test - unless it's some config 
 parameter.

 Thanks
 Jon

 ---

 Well, I think I understand what is happening - but I don't even want to 
 think about fixing it. I'm not sure which software is right.

 The sflow data coming from NGINX reports the number of various HTTP messages 
 (GET, HEAD, 1XX, 2XX, etc) in the measured period.
 The period is either 10 or 20 seconds - I don't have any idea why that isn't 
 consistent.

 When gmond receives the HTTP data in sflow format, it computes the 
 difference between the most recently reported value and the one before and 
 divides that by the reported interval. That is, it is expecting a running 
 total and that is NOT what is received.

 I don't know which software is right, but the NGINX reports are not what the 
 gmond handler expects.

 All the other sflow reports appear to be correct.

 -- Mark


 Flow plug-in:  I am still trying to find out, it is actually built by
 another group and I'm not sure what they pulled, but I'm pretty sure
 its 0.9.8

 hsflowd version 1.23.2

 gmond 3.6.0

  -
 On Tuesday, 4 March 2014, Silver, Jonathan jonathan.sil...@unify.com
 wrote:

 We're using NGINX and sflow, to capture and send the metrics to ganglia.
 The metric values look correct when viewed using sflowtool, but gmond
 (on the same box)is reporting them with all kinds of random values.

 Running gmond --debug=10 I do see some various error messages in the log:

 Some of these:
 sequence number error - 10.235.240.31:443-3:443 lostSamples=37

 Some of these:
 ERROR: [Errno 111] Connection refused

 And some with the hostname NULL:  (But only one time for each metric)
 ***Allocating value packet for host--(null)-- and metric
 --http_meth_put--
 


 Has anyone heard of this issue? I've started adding debug statements
 to gmond, but before I go through all of that, if it's a known issue.

 Thanks for any info,
 jon



 --
  Subversion Kills Productivity. Get off Subversion  Make the
 Move to Perforce.
 With Perforce, you get hassle-free workflows. Merge that actually works.
 Faster operations. Version large binaries.  Built-in WAN optimization
 and the freedom to use Git, Perforce or both. Make the move to
 Perforce.
 http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.
 clktrk ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general

 --
 Subversion Kills Productivity. Get off Subversion  Make the Move to 
 Perforce.
 With Perforce, you get hassle-free workflows. Merge that actually works.
 Faster operations. Version large binaries.  Built-in WAN optimization and the
 freedom to use Git, Perforce or both. Make the move to Perforce.
 http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general

--
Subversion Kills Productivity. Get off Subversion  Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to 

Re: [Ganglia-general] NGINX / SFLOW / Ganaglia - metrics get corrupted

2014-03-07 Thread neil mckee
Mark,

It does seem like the issue is with the sFlow from nginx-sflow-module.  I
wrote that module so I can probably help:

(1) just one instance of nginx on that server,  or two?
(2) what version of nginx?
(3) single-threaded or multi-threaded nginx?
(4) running on Linux OS?
(5) please upgrade to the latest nginx-sflow-module (0.9.8),  the one you
are running (0.9.7)  has a bug that affects graceful restarts.  The fix was
a one-liner,  so it's not a big step.
(6) please capture and send a trace of the sFlow packets arriving from this
nginx source.  For example,  if the IP address is 10.1.2.3 and it's coming
in on eth0:

root /usr/sbin/tcpdump -i eth0 -s 0 -w nginx_sflow.pcap udp port 6343 and
ip src 10.1.2.3
control-c after a few minutes to stop
root gzip nginx_sflow.pcap

then send nginx_sflow.pcap.gz

(7) please also send /etc/hsflowd.conf

The kind of thing it might be:
  - two nginx-sflow-modules running on the same host and not disambiguating
properly (supposed to happen automatically by choosing sflow datasource
index as lowest numbered TCP port number that process is listening on)

Regards,
Neil




On Fri, Mar 7, 2014 at 3:40 PM, Bernard Li bern...@vanhpc.org wrote:

 Can you connect to the gmond port and paste the XML for the metrics in
 question?  I'd like to see how they're defined.

 Thanks,

 Bernard

 On Fri, Mar 7, 2014 at 11:08 AM, Flanagan, Mark mark.flana...@unify.com
 wrote:
  http://www.sflow.org/ appears to be the defining entity for sflow.
  http://www.sflow.org/sflow_http.txt would appear to define the http
 sflow data.
 
  It is not explicitly clear just what the counter values are supposed
 to mean. The general architecture of sflow-like data would suggest the
 values should be a running counter (like the network interface metrics)
 which means gmond is implementing the packets properly and NGINX is sending
 the wrong data.
 
  That's just my guess for now.
 
 
  -Original Message-
  From: Bernard Li [mailto:bern...@vanhpc.org]
  Sent: Friday, March 07, 2014 1:39 PM
  To: Silver, Jonathan
  Cc: ganglia-general@lists.sourceforge.net; Flanagan, Mark
  Subject: Re: [Ganglia-general] NGINX / SFLOW / Ganaglia - metrics get
 corrupted
 
  Hi Jonathan:
 
  Perhaps you can share how these metrics are defined?
 
  Cheers,
 
  Bernard
 
  On Fri, Mar 7, 2014 at 10:21 AM, Silver, Jonathan
  jonathan.sil...@unify.com wrote:
  Does the following analysis mean anything to anyone?
  It seems to me that this is a basic thing that should have been seen by
 everyone else and found during first test - unless it's some config
 parameter.
 
  Thanks
  Jon
 
  ---
 
  Well, I think I understand what is happening - but I don't even want to
 think about fixing it. I'm not sure which software is right.
 
  The sflow data coming from NGINX reports the number of various HTTP
 messages (GET, HEAD, 1XX, 2XX, etc) in the measured period.
  The period is either 10 or 20 seconds - I don't have any idea why that
 isn't consistent.
 
  When gmond receives the HTTP data in sflow format, it computes the
 difference between the most recently reported value and the one before and
 divides that by the reported interval. That is, it is expecting a running
 total and that is NOT what is received.
 
  I don't know which software is right, but the NGINX reports are not
 what the gmond handler expects.
 
  All the other sflow reports appear to be correct.
 
  -- Mark
 
 
  Flow plug-in:  I am still trying to find out, it is actually built by
  another group and I'm not sure what they pulled, but I'm pretty sure
  its 0.9.8
 
  hsflowd version 1.23.2
 
  gmond 3.6.0
 
   -
  On Tuesday, 4 March 2014, Silver, Jonathan jonathan.sil...@unify.com
  wrote:
 
  We're using NGINX and sflow, to capture and send the metrics to
 ganglia.
  The metric values look correct when viewed using sflowtool, but gmond
  (on the same box)is reporting them with all kinds of random values.
 
  Running gmond --debug=10 I do see some various error messages in the
 log:
 
  Some of these:
  sequence number error - 10.235.240.31:443-3:443 lostSamples=37
 
  Some of these:
  ERROR: [Errno 111] Connection refused
 
  And some with the hostname NULL:  (But only one time for each metric)
  ***Allocating value packet for host--(null)-- and metric
  --http_meth_put--
  
 
 
  Has anyone heard of this issue? I've started adding debug statements
  to gmond, but before I go through all of that, if it's a known
 issue.
 
  Thanks for any info,
  jon
 
 
 
  --
   Subversion Kills Productivity. Get off Subversion  Make the
  Move to Perforce.
  With Perforce, you get hassle-free workflows. Merge that actually
 works.
  Faster operations. Version large binaries.  Built-in WAN optimization
  and the freedom to use Git, Perforce or both. Make the