On Thu, Mar 11, 2010 at 08:11:37PM +0000, Martin Gregorie wrote:
> Earlier today I mentioned that I have a number of portmanteau rules that
> fire on misspelt words in body text, etc. These are all structured along
> the lines of:
> 
> describe PORTMANTEAU Example of a somewhat unwieldy rule
> body     __PM1       /(word1|worrd2|wooord3|)/i
> body     __PM2       /(mispel|misspelll|mspell|)/i
> ....
> meta     PORTMANTEAU (__PM1||__PM2||...)
> score    PORTMANTEAU 0.01
> 
> Yes, they could be just one horrendously long rule, but until SA allows
> multi-line regexes that would be both unreadable and a nightmare to
> maintain. As it is, my current approach is starting to get unwieldy once
> it uses more than 6-8 subsidiary rules. So, it occurred to me that
> another way of handling this would be to export this sort of search to a
> support server written in C and using the PCRE library to handle
> matching.

Having a "server" for such seems horribly inefficient and bloated, compared
to SA which has the re's already compiled in-memory, scanning in-memory
chunks of body..

Why don't you simply maintain your wordlists in some files and use a script
to generate portmanteau.cf? You could use Regexp::Assemble module to
optimize also. Who cares what the actual rules look like? The more words
(simple alternations) there are in a single RE, the better it performs. If
you want clarity in the cf, keep the original words listed in a comment
block.

Reply via email to