[rrd-users] Getting an overview of many statistics...

2015-05-29 Thread Peter Valdemar Mørch
Hi,

I'm looking for a little inspiration and experience here.

We have a customer that has about 400 interfaces and he'd like to get an
overview of How these interfaces are doing. When there are more than
about 15-20, looking at each individual graph simply brakes down.

My user wants an idea of what the normal situation is, and information
about the worst outliers / extreme cases.

Looking at average and standard deviation is a possibility, but most of my
users (and I) really have no good intuitive feeling for what standard
deviation really means. Plus outlier/extreme information is lost.

I've seen that smokeping does something interesting, see e.g.

http://oss.oetiker.ch/smokeping-demo/img/Customers/OP/james~octopus_last_10800.png

The historgram approach where darker grey implies more datapoints in this
region could be cool. This gives the overview. Have no idea how this is
accomplished, though.

I was thinking of using a histogram approach like above overlayed with
showing the actual graphs of the N worst outliers/extremes. But that
implies lots of scripting and analysis to create the histogram (I'm
guessing) and to identify the outliers.

So: What have you guys done when creating an overview of many statistics?
I'll leave you with this picture from the gallery:

http://oss.oetiker.ch/rrdtool/gallery/576_nodes.png

This is exactly the situation I want to avoid

Sincerely,

Peter

-- 
Peter Valdemar Mørch
http://www.morch.com
___
rrd-users mailing list
rrd-users@lists.oetiker.ch
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users


Re: [rrd-users] Getting an overview of many statistics...

2015-05-29 Thread Alex van den Bogaerdt
I would generate a number of squares, showing either green, amber or red. 
Clicking on a square of interest would bring up detailed information for 
that interface (together with an RRDtool graph).


The layout of the squares depend on what I'm looking at. It could be a 
geographic map, a network map, or even just a matrix of 32 columns by as 
many rows as needed.


All green is all good.

I would NOT include logic which compares one interface to the others. Bad is 
still bad, even if other interfaces are also bad. If all interfaces are 
doing equally bad, you would want to show all red, not all green.


Squares could show information from the last hour and last hour only. Or, if 
so desired, a gradient from red to green, as a relative percentage of good 
vs. bad. over the last 24 hours or so.
Squares could be divided into 2 or 4 triangles, showing independent 
variables.
There will be some point where adding more information to the overview 
results in less readability.


Defining 'normal', 'not so good' and 'really bad' is a challenge which needs 
to be determined together with the customer. After all, these are his 
interfaces and his expectations. Bandwidth utilisation near 100% and packet 
loss would probably be important factors to make decisions. Maybe each 
interface could have its own set of limits.


Input data for the script can come from 'rrdtool graph'. Do not use graphing 
elements, use PRINT instead of GPRINT, and you can get averages, maxima, et 
cetera to use in your decision tree. The more information you need to 
extract, the more computing power will be needed.


Can RRDtool do the rest of what I suggested: no.  RRDtool is not a graphing 
program and although sometimes it is 'abused' as such, in many cases this 
involves unnessesary complexity.  Creating, filling and reading a database 
every time just to display 24 columns (1 for each hour) is IMHO a waste of 
resources. Just script it, or write a program in the language of choice. 
Depending on how complex you want to make it, you could create the overview 
page using a script generating html and css only, or create a complex 
program which uses a graphics library and generates a clickable map.


I'm sure others will have more suggestions, or can even provide suggestions 
for existing software to use instead of reinventing the wheel.


HTH
cheers,
Alex


- Original Message - 
From: Peter Valdemar Mørch pe...@morch.com

To: rrd-users@lists.oetiker.ch
Sent: Friday, May 29, 2015 11:38 AM
Subject: [rrd-users] Getting an overview of many statistics...


Hi,

I'm looking for a little inspiration and experience here.

We have a customer that has about 400 interfaces and he'd like to get an
overview of How these interfaces are doing. When there are more than
about 15-20, looking at each individual graph simply brakes down.

My user wants an idea of what the normal situation is, and information
about the worst outliers / extreme cases.

Looking at average and standard deviation is a possibility, but most of my
users (and I) really have no good intuitive feeling for what standard
deviation really means. Plus outlier/extreme information is lost.

I've seen that smokeping does something interesting, see e.g.

http://oss.oetiker.ch/smokeping-demo/img/Customers/OP/james~octopus_last_10800.png

The historgram approach where darker grey implies more datapoints in this
region could be cool. This gives the overview. Have no idea how this is
accomplished, though.

I was thinking of using a histogram approach like above overlayed with
showing the actual graphs of the N worst outliers/extremes. But that
implies lots of scripting and analysis to create the histogram (I'm
guessing) and to identify the outliers.

So: What have you guys done when creating an overview of many statistics?
I'll leave you with this picture from the gallery:

http://oss.oetiker.ch/rrdtool/gallery/576_nodes.png

This is exactly the situation I want to avoid

Sincerely,

Peter

--
Peter Valdemar Mørch
http://www.morch.com







___
rrd-users mailing list
rrd-users@lists.oetiker.ch
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users



___
rrd-users mailing list
rrd-users@lists.oetiker.ch
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users


Re: [rrd-users] Getting an overview of many statistics...

2015-05-29 Thread Simon Hobson
Peter Valdemar Mørch pe...@morch.com wrote:

 Looking at average and standard deviation is a possibility, but most of my 
 users (and I) really have no good intuitive feeling for what standard 
 deviation really means.

+1, I don't either

 I've seen that smokeping does something interesting, see e.g.
 
 http://oss.oetiker.ch/smokeping-demo/img/Customers/OP/james~octopus_last_10800.png
 
 The historgram approach where darker grey implies more datapoints in this 
 region could be cool. This gives the overview. Have no idea how this is 
 accomplished, though.

I'm not sure dark = more in the way you are expecting. I suspect it's more a 
case of shading ranges - so the central range (say the range that contains from 
40% to 60% of the results when sorted by time) is drawn dark, the ranges either 
side of that are drawn lighter, and so on until the outmost ranges (eg from 
smallest to say 10% and 90% to largest) are drawn in light grey. Finally the 
median is drawn as a line - who's colour indicates packet loss.

The areas can be drawn three ways. Lets assume we have 11 values, representing 
the ping times for the fastest (t0), the 10th decile (t1), through to the 
slowest (t10).

We can draw t0 to t10 in very light grey, then overlay t1 to t9 in less light 
grey, t2 - t8, t3 - t7, and finally overlay t4 to t6 in dark grey/black.
Or we can draw t0 to t1, stack t1 to t2, stack t2 to t3, t4 - t4, t4- t6, 
t6-t7, t7-t8, t8-t9, and finally t9-t10.
Or we can draw 0-t10 in light grey, then draw 0-t9 in darker grey, and so on 
until you've drawn 0-t1 in light grey. Then draw 0-t0 in white to erase the 
bit between axis and lowest value.

Neither is right or wrong - personally I'd do it the first way, which would be 
(from memory) something like :
DEF:t010=t10,t0,-
DEF:t19=t9,t1,-
...
AREA:t0#FF00  - *
AREA:t010#202020:STACK
AREA:t1#FF00  - *
AREA:t19#404040:STACK
...
* Note that I've used full transparency to draw nothing from the axis up to 
the bottom of each range.


Then you need to draw the line, and again you need to generate bands and then 
draw several overlays. Again there is more than one way :
You can draw the line, in each colour, only where that colour is needed; or you 
can draw the line in each colour, overlaying each colour on top of the previous 
one.
Eg, you could draw a red line all the way, then draw the light blue line only 
where packet loss 19/20, and so on until you draw the green line only where 
loss=0. Or you can draw the red line only where loss =19/20, the light blue 
line only where loss =10/20 and 19/20, and so on. The line itself is drawn at 
the median value (t5 - bet you were wondering where that had gone !)
Something like this :
LINE:t5#FF
DEF:l10=loss,19,lt,t5,unkn,if
LINE:t5#FF
...
Which means, draw t5 in red, then calculate l10 which equals t5 where loss 19 
otherwise it's set to unknown, then draw that in blue. Where L10 is unknown, 
then the line is not drawn and the red line shows through.
Repeat for the other steps.

So it's not actually all that hard to draw.



Alex van den Bogaerdt a...@vandenbogaerdt.nl wrote:

 I would generate a number of squares, showing either green, amber or red.

Just be aware that those colours are the ones that are affected by the most 
common form of colour blindness (red-green deficiency), which affects about 1 
in 7 males ! Many red/amber/green graphs are almost invisible to me (it 
depends on the area, the specific colours used, the display, ambient light, 
etc) and with some of them I really cannot see a change between the colour 
sections without blowing the screen up to make the areas larger.
A good example is the Unify software from Ubiquiti, where on one page, it shows 
a health bar for each access point showing red where there is a lot of 
competing traffic and packet loss and green for good. The bars are thin, and 
it was a while before I even realised that there was a red section on some of 
them !

___
rrd-users mailing list
rrd-users@lists.oetiker.ch
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users


Re: [rrd-users] Getting an overview of many statistics...

2015-05-29 Thread Alex Aminoff


On Fri, 29 May 2015, Simon Hobson wrote:


Peter Valdemar Mørch pe...@morch.com wrote:


Looking at average and standard deviation is a possibility, but most of my users (and I) 
really have no good intuitive feeling for what standard deviation really 
means.


+1, I don't either


I recommend Full House, by Stephen Jay Gould, or other essays of his.

Summary of one of his most well-known explanations: Why are there no more 
.400 hitters in baseball? Has the average quality of batters gone down, or 
the average quality of pitchers gone up, or some change to the rules that 
makes batting harder in general? No, none of those. What has happened is 
that the variability of batting has shrunk. So there is less distance 
between the very top batters and the rest of the (major league, already a 
select group) batters.


Standard deviation is a measure of variability; I think of it as the range 
in which an observed value is about 68% likely to be the result of random 
chance (as opposed to being different from the expected value because of 
some real cause).


If Babe Ruth bats .300 in 1915 and .320 in 1916 (I am making up these 
numbers), you would not think it was a big deal, because a .20 difference 
is batting average is pretty small compared to the standard deviation of 
player batting averages at the time. Whereas if David Ortiz bats .300 in 
2015 and .320 in 2016, you might be justified in thinking this is the 
result of something he is doing differently, because the .20 difference is 
big compared to the standard deviation of player batting averages in 2015.


Anyway, I wanted to respond to the OP with a script I wrote, attached. The 
documentation is very scanty, but you never know when something will be 
useful to someone.


  - Alex Aminoff
BaseSpace.net
National Bureau of Economic Research (nber.org)#!/usr/bin/perl

=head1 NAME

sdna -- Statistical Detection of Network Abberrance

=head1 SYNOPSIS

# as a cron job, every 10 minutes

sdna --query --read --quiet

# command line

sdna --grep nonzeroerrors switch1 switch2
sdna --read

=head1 OPTIONS

 --debug   debug
 --query   Queries all targets and saves collected date to RRD files
   in the RRD directory
 --grep shortcut Grep mode. Collections of include and exclude regexps are
   hard coded. Implies --query.
 --readRead RRD files, calculate stats, display most abberrant values
 --quiet   In read mode, disply no output unless network abberrance is above a 
threshold.
 --config file Config file to read

=head1 DESCRIPTION

sdna is intended to run periodically from cron. It calls snmpbulkwalk
to collect all SNMP values from each target IP address, and stores
each in a RRD (Round-Robin Database) file.

sdna makes use of RRDTool's Abberrant Behavior Detection
functionality.  For each value, we derive an estimate of how abberrant
that value is, which is basically a Z value, or the number of standard
deviations out from our estimated mean for the value.

Then, we aggrgate all the abberrances of all the values to get a grand
estimate of how unusual or abberrant the current state of the network
as a whole is. If greater than a threshold, we send an alert to an
operator.

sdna can also be used from the command line to produce a list of the
most deviant SNMP variables across the entire network. This might be
used to find which switch port a misbehaving device is on.

A feature of this system is that we try to be agnostic about what each
SNMP variable represents. It does not matter if it is bandwidth or
packet loss or the speed of the link - all we care about is how
different it is from its predicted value based on history. In practice
we can not quite be pure about this, see $SKIP_PATTERN

=head1 SEE ALSO

LRRDs,Lrrdtool(1),Lsnmpbulkwalk(1)

=head1 AUTHOR

Alex Aminoff, alex_amin...@alum.mit.edu

=head1 COPYRIGHT

Copyright 2013, shared by National Bureau of Economic Research
and Alexander Aminoff

=cut

use Getopt::Long;
use RRDs;

my %byshortcut = ( 
nonzeroerrors = [ [ qr/Error/o,1],
   [ qr/: 0/o,0],
],
);
#my $DIR = '/homes/nber/aminoff/DUMPHERE/nbersnmpdata/';
my $DIR = '/var/db/sdna/';

my $SKIP_PATTERN = 
qr/(SNMPv2-SMI::mib-2|SNMPv2-SMI::transmission|SNMPv2-MIB::snmp|IP-MIB::ipNetToMediaIfIndex|66\.251\.7|198\.71\.[67])/o;

my $debug = 0;
my $grep = '';
my $eachthreshold = 2; # threshold Z score to be counted as abberrant
my $masterthreshold = .1; # threshold of proportion abberant tests for alarm
my ($query,$read,$quiet) = (0,0,0);
my $config = '';
my $nofork = 0;

GetOptions('query' = \$query,
   'read' = \$read,
   'grep=s' = \$grep,
   'debug+' = \$debug,
   'quiet' = \$quiet,
   'nofork' = \$nofork,
   'config=s' = \$config,
);

if (! $query  ! $read  ! $grep) {
# default operation
$read=1;
}

if ($debug) {
print After cmd line args:\n debug:$debug query:$query read:$read 
quiet:$quiet grep:$grep \n;
}

if ($config) {