Earlier today I mentioned that I have a number of portmanteau rules that
fire on misspelt words in body text, etc. These are all structured along
the lines of:

describe PORTMANTEAU Example of a somewhat unwieldy rule
body     __PM1       /(word1|worrd2|wooord3|)/i
body     __PM2       /(mispel|misspelll|mspell|)/i
....
meta     PORTMANTEAU (__PM1||__PM2||...)
score    PORTMANTEAU 0.01

Yes, they could be just one horrendously long rule, but until SA allows
multi-line regexes that would be both unreadable and a nightmare to
maintain. As it is, my current approach is starting to get unwieldy once
it uses more than 6-8 subsidiary rules. So, it occurred to me that
another way of handling this would be to export this sort of search to a
support server written in C and using the PCRE library to handle
matching.

The server would load and compile collections of regexes from a
configuration file or database at start-up or when signalled and would
perform all comparisons against in-memory tables. Each distinct regex
collection would be named. A plugin would provide SA's interface to the
server, passing it the collection name and text to be searched and
returning a hit/miss result. Rules would use the eval() method to talk
to the plugin, telling it what part of the message to test and which
regex collection(s) to apply. The server should save some CPU cycles
since, although it would need to work sequentially through a regex
collection, it can stop as soon as it has a hit - something I think may
not be possible with my portmanteau rule design.

The server should be fairly easy to write since its logic is simple and,
even if single threaded, could still be a shared resource used by
multiple SA copies. It should be fairly small since most regex
collections are unlikely to exceed 1K in text form: its major memory use
would be the result of buffering large message bodies.

So, is this idea worth pursuing?

- is this something anybody apart from myself would use?
- apart from that, is it a stupid idea and, if so, what's wrong with it?
- are there obvious performance problems I haven't spotted:
  - am I right about all regexes in a portmanteau rule being applied
    to every message?
  - is the overhead of sending the body to the server a show-stopper?


Martin


Reply via email to