On Sunday 29 August 2004 HH:58:18, Jenda Krynicky wrote:
> From: Philipp Traeder <[EMAIL PROTECTED]>
>
> > You're right - the problem I'm trying to solve is quite restricted -
> > and I'm very thankful for this ;-) Basically, I'm trying to write an
> > application that "recognizes" log file formats, so that the following
> > lines are identified as several manifestations of the same log
> > message:
> >
> > could not delete user 3248234
> > could not delete user 2348723
> >
> > or even
> >
> > failed to connect to primary server
> > failed to connect to secondary server
> >
> > What I would like to see is a count of how many "manifestations" of
> > each log message are being thrown, independently of the actual data
> > they might contain. Since I do not want to hardcode the log messages
> > into my application, I would like to generate regexes on the fly as
> > they are needed.
>
> Well and how are you going to tell the program which messages to take
> as the same?
> Do you plan to teach the app as it reads the lines? Do you want it to
> ask which group is a line that doesn't match any of the regexps so
> far and have the regexp modified on the fly to match that line as
> well?
>
> Or what do you want to do?
>
> IMHO it might be best to use handmade regexps, just don't have them
> built into the application, but read from a config file. That is for
> each type of logs you'd have a file with something like this:
>
> delete_user=^could not delete user \d+
> connect=^failed to connect to (?:primary|secondary) server
> ...
>
> read the file, compile the regexps with qr// and have the application
> try to match them and have the messages counted in the first group
> whose regexp matches.
>
>
> Do I make sense or am I babbling nonsense?
You're making perfect sense - the problem is not as trivial as I thought
originally, but I think it's not that bad as long as you don't require a
precision of 100%.
In a perl script I wrote some time ago, I'm grouping log messages by comparing
them word by word, using the String::Compare module like this:
compare($message1, $message2, word_by_word => 5);
If I read the module's code correctly, the strings are split up by whitespace
and then compared char by char. Using this approach, I get a high similarity
even if the differing parts of the strings do not have the same length, like
in
failed to connect to primary server
failed to connect to secondary server
What I did now was to extend String::Compare in a way that it records the
differing parts of the strings in a string array for each string (actually, I
did not extend String::Compare, but ported it to Java, because I'm writing
the application in Java, but the idea should be the same) and returns a
"wildcarded" version of the string, i.e. a version that replaces each
character that is not identical in both strings with a wildcard string.
Currently, I'm not using the regexp that is generated in this way for matching
new messages, because I ran in some kind of deadlock: What should I do when I
get a message for which I do not have a matching regexp yet? Since I do have
only one occurence of this message so far, I can not detect a pattern, thus I
can not generate a regexp. Therefore, I've got to compare all messages that
follow in the method described above against the real messages, not against a
wildcarded version.
Anyway - if you choose the wildcard-character wisely, I think you should be
able to generate a regexp that is surely not as good as one written by a
human, but probably good enough (e.g. you could take
(.*?)
as wildcard character for each differing "word").
At the moment, this should be enough to solve my problem - I'm already using
the word-by-word string comparison successfully, and it looks as if the
ported/extended java version of String::Compare would do what I need.
Nevertheless I could imagine that you could build better regexps by comparing
the data that you extracted from the message (since I need to extract the
data anyway to use them later, this is a very likely option). Let's say I've
got the following log messages taken from a web application:
30/08/2004 23:25:01 processed request for a.html - took 35 ms
30/08/2004 23:25:05 processed request for ab.html - took 42 ms
30/08/2004 23:25:05 processed request for a.html - took 37 ms
My application compares the messages, detects that they are very similar, and
creates the following pattern (assuming that the wildcard char is an asterisk
and that a multi-character difference is replaced by one wildcard char):
30/08/2004 23:25:* processed request for *.html - took * ms
The differing data it extracts for the three lines is this:
01 a 35
05 ab 42
05 a 37
Going over the individual "columns" of data, the application could try to
match some pre-declared data formats, i.e. it could check if all values match
certain patterns like "\d+", "[azAZ]+" etc. If it finds a matching format, it
could adapt the regexp so that it matches more fine-grained.
You could object (and if I understood your mail correctly, you already did)
that the application created a wrong pattern by taking the date (including
the minutes, but not the seconds) as fixed - a log message that arrives a
minute later would not fit the regexp anymore. This is a problem if I'm
trying to use the regexp to match the messages, but not if I'm comparing the
messages as strings again (as described above).
Writing this, I think you're right - my problem is probably not solvable by
generating regexps on the fly, but only (hopefully) by comparing strings on a
more brute-force level. It might be an option to try to use regexps in order
to speed up the process, but if you do not find a matching regexp, you
probably need to go back to comparing strings again...
I've not finished the application yet, so I can't say if all of this is going
to work, but I'm quite optimistic at the moment. With a bit of luck, I can
show you a working version in a few weeks (FWIW: The application I'm talking
about will be a log4j server application - similar to chainsaw, but built for
the application operators as opposed to the developers).
Thank you for your insightful questions and suggestions - I appreciate
very much the opportunity to discuss those problems before running against too
many walls. :-)
Philipp
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>