Mark,
It does seem like the issue is with the sFlow from nginx-sflow-module. I
wrote that module so I can probably help:
(1) just one instance of nginx on that server, or two?
(2) what version of nginx?
(3) single-threaded or multi-threaded nginx?
(4) running on Linux OS?
(5) please upgrade to the latest nginx-sflow-module (0.9.8), the one you
are running (0.9.7) has a bug that affects graceful restarts. The fix was
a one-liner, so it's not a big step.
(6) please capture and send a trace of the sFlow packets arriving from this
nginx source. For example, if the IP address is 10.1.2.3 and it's coming
in on eth0:
root /usr/sbin/tcpdump -i eth0 -s 0 -w nginx_sflow.pcap udp port 6343 and
ip src 10.1.2.3
control-c after a few minutes to stop
root gzip nginx_sflow.pcap
then send nginx_sflow.pcap.gz
(7) please also send /etc/hsflowd.conf
The kind of thing it might be:
- two nginx-sflow-modules running on the same host and not disambiguating
properly (supposed to happen automatically by choosing sflow datasource
index as lowest numbered TCP port number that process is listening on)
Regards,
Neil
On Fri, Mar 7, 2014 at 3:40 PM, Bernard Li bern...@vanhpc.org wrote:
Can you connect to the gmond port and paste the XML for the metrics in
question? I'd like to see how they're defined.
Thanks,
Bernard
On Fri, Mar 7, 2014 at 11:08 AM, Flanagan, Mark mark.flana...@unify.com
wrote:
http://www.sflow.org/ appears to be the defining entity for sflow.
http://www.sflow.org/sflow_http.txt would appear to define the http
sflow data.
It is not explicitly clear just what the counter values are supposed
to mean. The general architecture of sflow-like data would suggest the
values should be a running counter (like the network interface metrics)
which means gmond is implementing the packets properly and NGINX is sending
the wrong data.
That's just my guess for now.
-Original Message-
From: Bernard Li [mailto:bern...@vanhpc.org]
Sent: Friday, March 07, 2014 1:39 PM
To: Silver, Jonathan
Cc: ganglia-general@lists.sourceforge.net; Flanagan, Mark
Subject: Re: [Ganglia-general] NGINX / SFLOW / Ganaglia - metrics get
corrupted
Hi Jonathan:
Perhaps you can share how these metrics are defined?
Cheers,
Bernard
On Fri, Mar 7, 2014 at 10:21 AM, Silver, Jonathan
jonathan.sil...@unify.com wrote:
Does the following analysis mean anything to anyone?
It seems to me that this is a basic thing that should have been seen by
everyone else and found during first test - unless it's some config
parameter.
Thanks
Jon
---
Well, I think I understand what is happening - but I don't even want to
think about fixing it. I'm not sure which software is right.
The sflow data coming from NGINX reports the number of various HTTP
messages (GET, HEAD, 1XX, 2XX, etc) in the measured period.
The period is either 10 or 20 seconds - I don't have any idea why that
isn't consistent.
When gmond receives the HTTP data in sflow format, it computes the
difference between the most recently reported value and the one before and
divides that by the reported interval. That is, it is expecting a running
total and that is NOT what is received.
I don't know which software is right, but the NGINX reports are not
what the gmond handler expects.
All the other sflow reports appear to be correct.
-- Mark
Flow plug-in: I am still trying to find out, it is actually built by
another group and I'm not sure what they pulled, but I'm pretty sure
its 0.9.8
hsflowd version 1.23.2
gmond 3.6.0
-
On Tuesday, 4 March 2014, Silver, Jonathan jonathan.sil...@unify.com
wrote:
We're using NGINX and sflow, to capture and send the metrics to
ganglia.
The metric values look correct when viewed using sflowtool, but gmond
(on the same box)is reporting them with all kinds of random values.
Running gmond --debug=10 I do see some various error messages in the
log:
Some of these:
sequence number error - 10.235.240.31:443-3:443 lostSamples=37
Some of these:
ERROR: [Errno 111] Connection refused
And some with the hostname NULL: (But only one time for each metric)
***Allocating value packet for host--(null)-- and metric
--http_meth_put--
Has anyone heard of this issue? I've started adding debug statements
to gmond, but before I go through all of that, if it's a known
issue.
Thanks for any info,
jon
--
Subversion Kills Productivity. Get off Subversion Make the
Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually
works.
Faster operations. Version large binaries. Built-in WAN optimization
and the freedom to use Git, Perforce or both. Make the