Lots to answer... let's start small. I think I can teach you enough to
answer your own questions so you can do a decent write-up for us!
I thought I had done a long write up on ring-buffers once upon a time, but I
can't find it... If you're a Computer Science type, as you read the rrd
tool docs, translate RRA -> ring buffer. It makes a lot more sense.
Anyway - there are three RRAs in each of the ntop rrd databases.
In the log is the rrd create line - (see, that's what I log it, y'all):
RRD: rrdtool create --start now-1 file --step 300
DS:counter:COUNTER:300:0:12500000
RRA:AVERAGE:0.5:1:864
RRA:MIN:0.5:1:72
RRA:MAX:0.5:1:72
RRA:AVERAGE:0.5:12:2160
RRA:AVERAGE:0.5:288:1080
(While that MIN and MAX look wrong, I think it's because they're 2nd order
values - i.e. the minimum of the 72 1h entries. I know I tried to change it
and it screwed everything up).
Anyway, let's ignore them for now and focus on the RRAs). man rrdcreate -
read it... 'splains a lot of this. In fact, let's quote it (RTFM) instead
of ME writing more words.
" DS:ds-name:DST:heartbeat:min:max
A single RRD can accept input from several data sources (DS).
(e.g. Incoming and Outgoing traffic on a specific
communication
line). With the DS configuration option you must define some
basic properties of each data source you want to use to feed
the RRD.
ds-name is the name you will use to reference this particular
data source from an RRD. A ds-name must be 1 to 19 characters
long in the characters [a-zA-Z0-9_].
DST defines the Data Source Type. See the section on "How to
Measure" below for further insight. The Datasource Type must
be one of the following:
...
COUNTER
is for continuous incrementing counters like the InOctets
counter in a router. The COUNTER data source assumes that
the counter never decreases, except when a counter over-
flows. The update function takes the overflow into
account. The counter is stored as a per-second rate.
When
the counter overflows, RRDtool checks if the overflow
hap-
pened at the 32bit or 64bit border and acts accordingly
by
adding an appropriate value to the result.
...
heartbeat defines the maximum number of seconds that may pass
between two updates of this data source before the value of
the
data source is assumed to be *UNKNOWN*.
min and max are optional entries defining the expected range
of
the data supplied by this data source. If min and/or max are
defined, any value outside the defined range will be regarded
as *UNKNOWN*. If you do not know or care about min and max,
set
them to U for unknown. Note that min and max always refer to
the processed values of the DS. For a traffic-COUNTER type DS
this would be the max and min data-rate expected from the
device.
If information on minimal/maximal expected values is
available,
always set the min and/or max properties. This will help RRD-
tool in doing a simple sanity check on the data supplied when
running update."
OK?
So "DS:counter:COUNTER:300:0:12500000" means we're defining a 'data source',
named 'counter', which can go no more than 300 seconds between data points
(otherwise they're 'unknown') and can have values from 0..12,500,000
Now you can DO updates (man rrdupdate) at 1s intervals, but rrd will just
combine them into that 300s interval...
" The update function feeds new data values into an RRD. The data gets
time aligned according to the properties of the RRD to which the data
is written."
Now, I *think* this means that if you do this:
rrdtool update ipbytes.rrd 887457267:10
rrdtool update ipbytes.rrd 887457268:10
rrdtool update ipbytes.rrd 887457269:10
it's going to update whatever row 88745726x falls into with 10+10+10 / 300s
or 0.1/second... but I'm not 100% sure. Anyway, ntop isn't SUPPOSED to
make more than one update per interval. So it's not SUPPOSED to matter.
So let's move on to the RRA lines - again from the man page, we're defining
three RRAs (ring buffers), according to this:
" RRA:CF:xff:steps:rows
The purpose of an RRD is to store data in the round robin
archives (RRA). An archive consists of a number of data
values
from all the defined data-sources (DS) and is defined with an
RRA line.
When data is entered into an RRD, it is first fit into time
slots of the length defined with the -s option becoming a
pri-
mary data point.
The data is also consolidated with the consolidation function
(CF) of the archive. The following consolidation functions
are
defined: AVERAGE, MIN, MAX, LAST.
xff The xfiles factor defines what part of a consolidation
interval may be made up from *UNKNOWN* data while the
consoli-
dated value is still regarded as known.
steps defines how many of these primary data points are used
to
build a consolidated data point which then goes into the
archive.
rows defines how many generations of data values are kept in
an
RRA."
So, RRA:AVERAGE:0.5:1:864 means:
we're going to store 864 rows of data. (The ring concept means that the
865th value overlays the 1st. You always have the most recent 864, never
more or less - although when you create the RRA at time t, the 863 values
for times less than t are 'unknown').
Each row is a consolidation of 1 primary point.
So assume the slots we record are: 13:05 - 300 packets
13:10 - 300 packets
13:15 - 600 packets
13:20 - missing (no packets)
13:25 - 450 packets
etc.
Our data 'rows' are now (remember or learn that rrd converts the absolute
numbers into a per second value):
1.0
1.0
2.0
-
1.5
and so on, for the full 864 rows at 5m intervals (72 hours).
The second RRA, "RRA:AVERAGE:0.5:12:2160"
is a roll up of 12 primary points (e.g. 60 minutes or 1 hour), and there are
2160 (90 days) worth. We average the primary points and no more than 50%
(0.5) can be missing...
The third RRA is left as an exercise for the reader.
But you can see how data can be 'lost' between RRAs, right?
Say our 'hour' is this:
01. -
02. -
03. -
04. -
05. -
06. -
07. -
08. 10.0 (10.0 per second for 300s = 3000 packets)
09. 10.0
10. 10.0
11. 10.0
12. 10.0
So our 5m interval graph shows 0 0 0 0 0 0 0 10 10 10 10 10, representing
15K packets.
However, more than 50% is missing.
So the 1h interval shows 0 - and 15K packets are 'lost'.
There are two ways to 'fix' this, and both have costs.
Fix one is to record 0s. This means that EVERY rrd will have to be updated
for EVERY pass - which is a huge increase in work effort for ntop.
Fix two is to change the 0.5 to 0.1 or 0.0. The issue here is that it will
obscure truly missing data. If ntop was down for the first 35m of that hour
or the data really is missing, with a 0.0 value you're going to show a value
of 4.1 (15000/3600) as the 'average' rate for the hour. This really isn't
true. But it may be a compromise you can live with...
Although it's counter-intuitive (and undocumented), rrdgraph does not use
data from multiple RRAs. From experimentation, it seems to pick the rrd
that has the 'best' coverage. So say you're doing a 6h graph and have 3h of
data in the 5m RRA and 4h of data in the 1h RRA. It will use the 4 hourly
points.
Still with me? You should now be able to answer or reformulate all of your
questions, except for # 5. For that, look at rrdPlugin.c:
argv[argc++] = "GPRINT:ctr:MIN:Min\\: %3.1lf%s";
and at man rrdgraph
" If an additional '%s' is found AFTER the marker, the value will
be
scaled and an appropriate SI magnitude unit will be printed in
place of the '%s' marker. The scaling will take the '--base'
argu-
ment into consideration!"
OK?
So 59.4 means 59.4 per second
and 35.6k means 35600 per second
WRT to "LITTLE PROBLEM", are you sure that your line isn't actually bursting
above? What's the CIR and MIR values??
-----Burton
US-based commercial support for ntop:
http://www.ntopsupport.com
mailto:[EMAIL PROTECTED]
Search the ntop mailing lists at gmane:
http://search.gmane.org
HowTo Ask for Help at
http://snapshot.ntop.org/faq.php#83
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of
Filippo Stefanelli
Sent: Friday, July 25, 2003 10:59 AM
To: [EMAIL PROTECTED]
Subject: [Ntop] RRD graphs totally screwed up
Hi all,
Hope to have a good thread on this subject that is litterally driving me
crazy.
I think that a good thread will help many peoples around there.
All pages and images I will talk about are in the attached file.
There are two different problems (different in term of magnitude), let's
go.
MAIN PROBLEM
I set up ntop with this filter to monitor only traffic that goes out of the
local net via our router and only to see ip traffic:
ip and (ether host gw-pd-eth and not host gw-pd)
Where ^^^^^^^ ^^^^^ are the mac
address and ip address of the gateway.
After; I have monitored our network during a normal period of working
time of some days and I have saved the graphs that are generated
via the rrdplugin script.
After; I have stressed the link via very big ftp/http/misc traffic request
and now I want to see what changes in the graphs displayed by the
rrdplugin script
And here comes the problem. The graphs are totally insane as you can
see.
I can't understand why:
1) In the 1h graph some graphs are after a certain time and others
only before
1bis) Why in the others graphs there are holes of data during some
periods of time
2) Some protocols have bars more and more larger than others
3) Some protocols (often the last ones of the list) have only isolated
spikes even if they are stress test protocols and so they are always
present (I mean for example ftp)
4) In some link (ex: 1year, 1month) some protocols plot boxes appear
but empty. I think because the temporal width is too large but I have
seen the same behaviour with the plot of protocols with smaller
temporal windows.
5) What is the unit of measurement "m"
6) How is calculated the max,min,average,current values
I say this sixt question because when I launched from a pc inside the
monitored lan an http file transfer the data rate indicated by wget was
30K/s but in that period of the file transfer ntop displayed the http prot
as 8K/s of average and current with spikes of 10K/s ?!?
A totally incongruent measurement with that of wget.
The only thing that I have understand is the sum value and what
means that changes with the time period displayed.
I am totally disoriented.
I PROMISE THAT I WILL WRITE ALL THE THINGS THAT I WILL LEARN
FROM THIS SITUATION AND I WILL GIVE THIS DOCS TO THE NTOP
COMMUNITY. This will be an appendix of my thesis.
LITTLE PROBLEM
The lan that I am testing has a link with the outside worlf of 256K/s
Why when the line is stressed I have the network load graph that
goes above 250K/s as you can see from the attached image?!?
I have tried to think in terms of average and not absolut but I have not
understand neither in this way.
PLEASE HELP ME!!!!!
I thinked that ntop was a good companion to do network analysys in
term of usage of the network splitted by protocols and over time but
now I am in front of a very big problem and I can't figure out how to
read the rrd graphs!
I am in your hands community, please read and give me some clues!
Sorry for my horrible english.
Filippo.
PS: in the pics you will se, among the others, the name of protocols like
LotusNotes or MiniSoft. These are protocols of interest in this corp.,
nothing strange.
THANKS AGAIN!!!
_______________________________________________
Ntop mailing list
[EMAIL PROTECTED]
http://listgateway.unipi.it/mailman/listinfo/ntop