Re: Regular expression generator/creator

Philipp Traeder Mon, 30 Aug 2004 14:55:05 -0700

On Sunday 29 August 2004 HH:58:18, Jenda Krynicky wrote:
> From: Philipp Traeder <[EMAIL PROTECTED]>
>
> > You're right - the problem I'm trying to solve is quite restricted -
> > and I'm very thankful for this ;-) Basically, I'm trying to write an
> > application that "recognizes" log file formats, so that the following
> > lines are identified as several manifestations of the same log
> > message:
> >
> >   could not delete user 3248234
> >   could not delete user 2348723
> >
> > or even
> >
> >   failed to connect to primary server
> >   failed to connect to secondary server
> >
> > What I would like to see is a count of how many "manifestations" of
> > each log message are being thrown, independently of the actual data
> > they might contain. Since I do not want to hardcode the log messages
> > into my application, I would like to generate regexes on the fly as
> > they are needed.
>
> Well and how are you going to tell the program which messages to take
> as the same?
> Do you plan to teach the app as it reads the lines? Do you want it to
> ask which group is a line that doesn't match any of the regexps so
> far and have the regexp modified on the fly to match that line as
> well?
>
> Or what do you want to do?
>
> IMHO it might be best to use handmade regexps, just don't have them
> built into the application, but read from a config file. That is for
> each type of logs you'd have a file with something like this:
>
>       delete_user=^could not delete user \d+
>       connect=^failed to connect to (?:primary|secondary) server
>       ...
>
> read the file, compile the regexps with qr// and have the application
> try to match them and have the messages counted in the first group
> whose regexp matches.
>
>
> Do I make sense or am I babbling nonsense?


You're making perfect sense - the problem is not as trivial as I thought 
originally, but I think it's not that bad as long as you don't require a 
precision of 100%.

In a perl script I wrote some time ago, I'm grouping log messages by comparing 
them word by word, using the String::Compare module like this:

        compare($message1, $message2, word_by_word => 5);

If I read the module's code correctly, the strings are split up by whitespace 
and then compared char by char. Using this approach, I get a high similarity 
even if the differing parts of the strings do not have the same length, like 
in
        failed to connect to primary server
        failed to connect to secondary server

What I did now was to extend String::Compare in a way that it records the 
differing parts of the strings in a string array for each string (actually, I 
did not extend String::Compare, but ported it to Java, because I'm writing 
the application in Java, but the idea should be the same) and returns a 
"wildcarded" version of the string, i.e. a version that replaces each 
character that is not identical in both strings with a wildcard string.

Currently, I'm not using the regexp that is generated in this way for matching 
new messages, because I ran in some kind of deadlock: What should I do when I 
get a message for which I do not have a matching regexp yet? Since I do have 
only one occurence of this message so far, I can not detect a pattern, thus I 
can not generate a regexp. Therefore, I've got to compare all messages that 
follow in the method described above against the real messages, not against a 
wildcarded version.
Anyway - if you choose the wildcard-character wisely, I think you should be 
able to generate a regexp that is surely not as good as one written by a 
human, but probably good enough (e.g. you could take 
  (.*?)
as wildcard character for each differing "word").

At the moment, this should be enough to solve my problem - I'm already using 
the word-by-word string comparison successfully, and it looks as if the 
ported/extended java version of String::Compare would do what I need. 
Nevertheless I could imagine that you could build better regexps by comparing 
the data that you extracted from the message (since I need to extract the 
data anyway to use them later, this is a very likely option). Let's say I've 
got the following log messages taken from a web application:

  30/08/2004 23:25:01 processed request for a.html - took 35 ms
  30/08/2004 23:25:05 processed request for ab.html - took 42 ms
  30/08/2004 23:25:05 processed request for a.html - took 37 ms

My application compares the messages, detects that they are very similar, and 
creates the following pattern (assuming that the wildcard char is an asterisk 
and that a multi-character difference is replaced by one wildcard char):

  30/08/2004 23:25:* processed request for *.html - took * ms

The differing data it extracts for the three lines is this:

  01  a   35
  05  ab  42
  05  a   37

Going over the individual "columns" of data, the application could try to 
match some pre-declared data formats, i.e. it could check if all values match 
certain patterns like "\d+", "[azAZ]+" etc. If it finds a matching format, it 
could adapt the regexp so that it matches more fine-grained.

You could object (and if I understood your mail correctly, you already did) 
that the application created a wrong pattern by taking the date (including 
the minutes, but not the seconds) as fixed - a log message that arrives a 
minute later would not fit the regexp anymore. This is a problem if I'm 
trying to use the regexp to match the messages, but not if I'm comparing the 
messages as strings again (as described above).

Writing this, I think you're right - my problem is probably not solvable by 
generating regexps on the fly, but only (hopefully) by comparing strings on a 
more brute-force level. It might be an option to try to use regexps in order 
to speed up the process, but if you do not find a matching regexp, you 
probably need to go back to comparing strings again...

I've not finished the application yet, so I can't say if all of this is going 
to work, but I'm quite optimistic at the moment. With a bit of luck, I can 
show you a working version in a few weeks (FWIW: The application I'm talking 
about will be a log4j server application - similar to chainsaw, but built for 
the application operators as opposed to the developers).

Thank you for your insightful questions and suggestions - I appreciate 
very much the opportunity to discuss those problems before running against too 
many walls. :-)

Philipp

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Regular expression generator/creator

Reply via email to