RE: [Ntop] RRD graphs totally screwed up

Burton M. Strauss III Fri, 25 Jul 2003 09:58:25 -0700

Lots to answer... let's start small.  I think I can teach you enough to
answer your own questions so you can do a decent write-up for us!


I thought I had done a long write up on ring-buffers once upon a time, but I
can't find it...  If you're a Computer Science type, as you read the rrd
tool docs, translate RRA -> ring buffer.  It makes a lot more sense.

Anyway - there are three RRAs in each of the ntop rrd databases.

In the log is the rrd create line - (see, that's what I log it, y'all):

RRD: rrdtool create --start now-1 file --step 300
   DS:counter:COUNTER:300:0:12500000
   RRA:AVERAGE:0.5:1:864
   RRA:MIN:0.5:1:72
   RRA:MAX:0.5:1:72
   RRA:AVERAGE:0.5:12:2160
   RRA:AVERAGE:0.5:288:1080

(While that MIN and MAX look wrong, I think it's because they're 2nd order
values - i.e. the minimum of the 72 1h entries.  I know I tried to change it
and it screwed everything up).

Anyway, let's ignore them for now and focus on the RRAs).  man rrdcreate -
read it... 'splains a lot of this.  In fact, let's quote it (RTFM) instead
of ME writing more words.

"       DS:ds-name:DST:heartbeat:min:max
               A single RRD can accept input from several data sources (DS).
               (e.g. Incoming and Outgoing traffic on a specific
communication
               line). With the DS configuration option you must define some
               basic properties of each data source you want to use to feed
               the RRD.

               ds-name is the name you will use to reference this particular
               data source from an RRD. A ds-name must be 1 to 19 characters
               long in the characters [a-zA-Z0-9_].

               DST defines the Data Source Type. See the section on "How to
               Measure" below for further insight.  The Datasource Type must
               be one of the following:
...
               COUNTER
                   is for continuous incrementing counters like the InOctets
                   counter in a router. The COUNTER data source assumes that
                   the counter never decreases, except when a counter over-
                   flows.  The update function takes the overflow into
                   account.  The counter is stored as a per-second rate.
When
                   the counter overflows, RRDtool checks if the overflow
hap-
                   pened at the 32bit or 64bit border and acts accordingly
by
                   adding an appropriate value to the result.
...
               heartbeat defines the maximum number of seconds that may pass
               between two updates of this data source before the value of
the
               data source is assumed to be *UNKNOWN*.

               min and max are optional entries defining the expected range
of
               the data supplied by this data source. If min and/or max are
               defined, any value outside the defined range will be regarded
               as *UNKNOWN*. If you do not know or care about min and max,
set
               them to U for unknown. Note that min and max always refer to
               the processed values of the DS. For a traffic-COUNTER type DS
               this would be the max and min data-rate expected from the
               device.

               If information on minimal/maximal expected values is
available,
               always set the min and/or max properties. This will help RRD-
               tool in doing a simple sanity check on the data supplied when
               running update."

OK?

So "DS:counter:COUNTER:300:0:12500000" means we're defining a 'data source',
named 'counter', which can go no more than 300 seconds between data points
(otherwise they're 'unknown') and can have values from 0..12,500,000

Now you can DO updates (man rrdupdate) at 1s intervals, but rrd will just
combine them into that 300s interval...

"       The update function feeds new data values into an RRD. The data gets
       time aligned according to the properties of the RRD to which the data
       is written."

Now, I *think* this means that if you do this:

rrdtool update ipbytes.rrd 887457267:10
rrdtool update ipbytes.rrd 887457268:10
rrdtool update ipbytes.rrd 887457269:10

it's going to update whatever row 88745726x falls into with 10+10+10 / 300s
or 0.1/second...  but I'm not 100% sure.  Anyway, ntop isn't SUPPOSED to
make more than one update per interval.  So it's not SUPPOSED to matter.

So let's move on to the RRA lines - again from the man page, we're defining
three RRAs (ring buffers), according to this:

"       RRA:CF:xff:steps:rows
               The purpose of an RRD is to store data in the round robin
               archives (RRA). An archive consists of a number of data
values
               from all the defined data-sources (DS) and is defined with an
               RRA line.

               When data is entered into an RRD, it is first fit into time
               slots of the length defined with the -s option becoming a
pri-
               mary data point.

               The data is also consolidated with the consolidation function
               (CF) of the archive. The following consolidation functions
are
               defined: AVERAGE, MIN, MAX, LAST.

               xff The xfiles factor defines what part of a consolidation
               interval may be made up from *UNKNOWN* data while the
consoli-
               dated value is still regarded as known.

               steps defines how many of these primary data points are used
to
               build a consolidated data point which then goes into the
               archive.

               rows defines how many generations of data values are kept in
an
               RRA."

So, RRA:AVERAGE:0.5:1:864 means:

we're going to store 864 rows of data. (The ring concept means that the
865th value overlays the 1st.  You always have the most recent 864, never
more or less - although when you create the RRA at time t, the 863 values
for times less than t are 'unknown').

Each row is a consolidation of 1 primary point.

So assume the slots we record are:  13:05 - 300 packets
                                    13:10 - 300 packets
                                    13:15 - 600 packets
                                    13:20 - missing (no packets)
                                    13:25 - 450 packets
etc.

Our data 'rows' are now (remember or learn that rrd converts the absolute
numbers into a per second value):

1.0
1.0
2.0
-
1.5

and so on, for the full 864 rows at 5m intervals (72 hours).

The second RRA, "RRA:AVERAGE:0.5:12:2160"

is a roll up of 12 primary points (e.g. 60 minutes or 1 hour), and there are
2160 (90 days) worth.  We average the primary points and no more than 50%
(0.5) can be missing...

The third RRA is left as an exercise for the reader.


But you can see how data can be 'lost' between RRAs, right?

Say our 'hour' is this:

01. -
02. -
03. -
04. -
05. -
06. -
07. -
08. 10.0  (10.0 per second for 300s = 3000 packets)
09. 10.0
10. 10.0
11. 10.0
12. 10.0

So our 5m interval graph shows 0 0 0 0 0 0 0 10 10 10 10 10, representing
15K packets.

However, more than 50% is missing.

So the 1h interval shows 0 - and 15K packets are 'lost'.


There are two ways to 'fix' this, and both have costs.

Fix one is to record 0s.  This means that EVERY rrd will have to be updated
for EVERY pass - which is a huge increase in work effort for ntop.

Fix two is to change the 0.5 to 0.1 or 0.0.  The issue here is that it will
obscure truly missing data.  If ntop was down for the first 35m of that hour
or the data really is missing, with a 0.0 value you're going to show a value
of 4.1 (15000/3600) as the 'average' rate for the hour.  This really isn't
true.  But it may be a compromise you can live with...


Although it's counter-intuitive (and undocumented), rrdgraph does not use
data from multiple RRAs.  From experimentation, it seems to pick the rrd
that has the 'best' coverage.  So say you're doing a 6h graph and have 3h of
data in the 5m RRA and 4h of data in the 1h RRA.  It will use the 4 hourly
points.


Still with me?  You should now be able to answer or reformulate all of your
questions, except for # 5.  For that, look at rrdPlugin.c:

    argv[argc++] = "GPRINT:ctr:MIN:Min\\: %3.1lf%s";

and at man rrdgraph

"           If an additional '%s' is found AFTER the marker, the value will
be
           scaled and an appropriate SI magnitude unit will be printed in
           place of the '%s' marker. The scaling will take the '--base'
argu-
           ment into consideration!"

OK?

So 59.4 means 59.4 per second
and 35.6k means 35600 per second


WRT to "LITTLE PROBLEM", are you sure that your line isn't actually bursting
above?  What's the CIR and MIR values??


-----Burton

US-based commercial support for ntop:
     http://www.ntopsupport.com
     mailto:[EMAIL PROTECTED]

Search the ntop mailing lists at gmane:
     http://search.gmane.org

HowTo Ask for Help at
     http://snapshot.ntop.org/faq.php#83



-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of
Filippo Stefanelli
Sent: Friday, July 25, 2003 10:59 AM
To: [EMAIL PROTECTED]
Subject: [Ntop] RRD graphs totally screwed up


Hi all,

Hope to have a good thread on this subject that is litterally driving me
crazy.
I think that a good thread will help many peoples around there.

All pages and images I will talk about are in the attached file.

There are two different problems (different in term of magnitude), let's
go.

MAIN PROBLEM

I set up ntop with this filter to monitor only traffic that goes out of the
local net via our router and only to see ip traffic:
ip and (ether host gw-pd-eth and not host gw-pd)
Where                   ^^^^^^^                     ^^^^^     are the mac
address and ip address of the gateway.

After; I have monitored our network during a normal period of working
time of some days and I have saved the graphs that are  generated
via the rrdplugin script.

After; I have stressed the link via very big ftp/http/misc traffic request
and now I want to see what changes in the graphs displayed by the
rrdplugin script

And here comes the problem. The graphs are totally insane as you can
see.
I can't understand why:
1) In the 1h graph some graphs are after a certain time and others
only before
1bis) Why in the others graphs there are holes of data during some
periods of time
2) Some protocols have bars more and more larger than others
3) Some protocols (often the last ones of the list) have only isolated
spikes even if they are stress test protocols and so they are always
present (I mean for example ftp)
4) In some link (ex: 1year, 1month) some protocols plot boxes appear
but empty. I think because the temporal width is too large but I have
seen the same behaviour with the plot of protocols with smaller
temporal windows.
5) What is the unit of measurement "m"
6) How is calculated the max,min,average,current values
I say this sixt question because when I launched from a pc inside the
monitored lan an http file transfer the data rate indicated by wget was
30K/s but in that period of the file transfer ntop displayed the http prot
as 8K/s of average and current with spikes of 10K/s ?!?
 A totally incongruent measurement with that of wget.

The only thing that I have understand is the sum value and what
means that changes with the time period displayed.

I am totally disoriented.
I PROMISE THAT I WILL WRITE ALL THE THINGS THAT I WILL LEARN
FROM THIS SITUATION AND I WILL GIVE THIS DOCS TO THE NTOP
COMMUNITY. This will be an appendix of my thesis.

 LITTLE PROBLEM

The lan that I am testing has a link with the outside worlf of 256K/s
Why when the line is stressed I have the network load graph that
goes above 250K/s as you can see from the attached image?!?

I have tried to think in terms of average and not absolut but I have not
understand neither in this way.

PLEASE HELP ME!!!!!
I thinked that ntop was a good companion to do network analysys in
term of usage of the network splitted by protocols and over time but
now I am in front of a very big problem and I can't figure out how to
read the rrd graphs!

I am in your hands community, please read and give me some clues!

Sorry for my horrible english.

Filippo.

PS: in the pics you will se, among the others, the name of protocols like
LotusNotes or MiniSoft. These are protocols of interest in this corp.,
nothing strange.

THANKS AGAIN!!!

_______________________________________________
Ntop mailing list
[EMAIL PROTECTED]
http://listgateway.unipi.it/mailman/listinfo/ntop

RE: [Ntop] RRD graphs totally screwed up

Reply via email to