On Thu, Mar 11, 2010 at 08:11:37PM +0000, Martin Gregorie wrote: > Earlier today I mentioned that I have a number of portmanteau rules that > fire on misspelt words in body text, etc. These are all structured along > the lines of: > > describe PORTMANTEAU Example of a somewhat unwieldy rule > body __PM1 /(word1|worrd2|wooord3|)/i > body __PM2 /(mispel|misspelll|mspell|)/i > .... > meta PORTMANTEAU (__PM1||__PM2||...) > score PORTMANTEAU 0.01 > > Yes, they could be just one horrendously long rule, but until SA allows > multi-line regexes that would be both unreadable and a nightmare to > maintain. As it is, my current approach is starting to get unwieldy once > it uses more than 6-8 subsidiary rules. So, it occurred to me that > another way of handling this would be to export this sort of search to a > support server written in C and using the PCRE library to handle > matching.
Having a "server" for such seems horribly inefficient and bloated, compared to SA which has the re's already compiled in-memory, scanning in-memory chunks of body.. Why don't you simply maintain your wordlists in some files and use a script to generate portmanteau.cf? You could use Regexp::Assemble module to optimize also. Who cares what the actual rules look like? The more words (simple alternations) there are in a single RE, the better it performs. If you want clarity in the cf, keep the original words listed in a comment block.