Re: Log analysis server suggestions? [long]

2006-02-20 Thread Ashley Moran
On Thursday 16 February 2006 15:30, Chuck Swiger wrote:
 I'm not sure who the original poster was, but whoever is interested in this
 topic might benefit by reading a thread from the firewall-wizards mailing
 list:

snip

Cheers that was very useful- I've put it into our company Wiki so it can be 
ignored by everyone :)

I like the 3-stage processing:
 Simply design your analysis as an always 3-stage process consisting of:
 - weeding out and counting instances of uninteresting events
 - selecting, parsing sub-fields of, and processing interesting events
 - retaining events that fell through the first two steps as unusual

That solves the problem of missing logs that you didn't anticipate, although 
it adds a lot to the initial server configuration.

Ashley
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Log analysis server suggestions? [long]

2006-02-16 Thread Chuck Swiger
Ashley Moran [EMAIL PROTECTED] wrote:
[ ... ]

I'm not sure who the original poster was, but whoever is interested in this
topic might benefit by reading a thread from the firewall-wizards mailing list:

 Original Message 
Subject: [fw-wiz] parsing logs ultra-fast inline
Date: Wed, 01 Feb 2006 16:03:38 -0500
From: Marcus J. Ranum [EMAIL PROTECTED]
To: firewall-wizards@honor.icsalabs.com

OK, based on some offline discussion with a few people, about
doing large amounts of system log processing inline at high
speeds, I thought I'd post a few code fragments and some ideas
that have worked for me in the past.

First off, if you want to handle truly ginormous amounts of log
data quickly, you need to build a structure wherein you're making
decisions quickly at a broad level, then drilling down based on
the results of the decision. This allows you to parallelize infinitely
because all you do is make the first branch in your decision tree
stripe across all your analysis engines. So, hypothetically,
let's say we were handling typical UNIX syslogs at a ginormous
volume, we might have one engine (CPU/process or even a
separate box/bus/backplane/CPU/drive array) responsible for
(sendmail | named) and another one responsible for (apache | imapd)
etc. If you put some simple counters in your analysis routines
(hits versus misses) you can load balance your first tree-branch
appropriately using a flat percentage. Also, remember, if you
standardize your processing, it doesn't matter where it happens;
it can happen at the edge/source or back in a central location
or any combination of the two. Simply design your analysis as
an always 3-stage process consisting of:
- weeding out and counting instances of uninteresting events
- selecting, parsing sub-fields of, and processing interesting events
- retaining events that fell through the first two steps as unusual
The results of these 3 stages are
- a set of event-IDs and counts
- a set of event-IDs and interesting fields and counts
- residual data in raw form
back-haul the event-IDS and counts and fields and graph them or stuff
them into a database, and bring the residual data to the attention of a human
being.

I suppose if you needed to you could implement a log load
balancer in the form of a box that had N interfaces that
collected a fat stream of log data, ran a simple program
that sorted the stream into 1/N sub-streams and forwarded
them to backend engines for more involved processing. You
could scale your logging architecture to very very large
loads this way. It works for Google and it'd work for you, too.

The first phase of processing is to stripe across engines if
necessary, then inside each engine you stripe the processing
into functional sub-parsers that deal with a given message
format. The implementation is language-irrelevant though your
language choice will affect performance. Typically you write a
main loop that looks like:
while ( get a message ) {
if(message is a sendmail message)
parse sendmail message
if(message is an imap message)
parse imap message
...
}

Once your system has run on a sample dataset you will be able
to determine which messages come most frequently and you can
put that test at the top of the loop. This can result in an enormous
performance boost.

Each sub-parse routine follows the same structure as the
main loop, performing a sorted series of checks to sub-parse
the fields of the message-specific formats. I.e.:

parse sendmail message( )  {
if(message is a stat=sent message) {
pull out recipient;
pull out sender;
increment message sent count;
add message size to sender score;
done
}
if(message is a stat=retry message) {
ignore; //done
}
if(message is a whatever) {
whatever;
done
}

// if we fell through to here we have a new message structure
// we have never seen before!!
vector message to interesting folder;
}

Variant messages are a massive pain in the butt; you need to decide
whether to deal with variants as separate cases or to make the
sub-parser smarter in order to deal with them. This is one of the
reasons I keep saying that system log analysis is highly site
specific! If your site doesn't get system logs from a DECStation
5000 running ULTRIX 3.1D then you don't need to parse that data.
Indeed, if you build your parse tree around that notion then, if you
suddenly start getting ULTRIX format log records in your data
stream, that'd be - shall we say - interesting and you want to know
about it. I remember when I was looking at some log data at one
site (Hi Abe!) we found one log message that was about 400K long
on a single line. It appeared that a fairly crucial piece of software
decided to spew a chunk of its memory in a log message, for no
apparent reason.