Earlier today I mentioned that I have a number of portmanteau rules that fire on misspelt words in body text, etc. These are all structured along the lines of:
describe PORTMANTEAU Example of a somewhat unwieldy rule body __PM1 /(word1|worrd2|wooord3|)/i body __PM2 /(mispel|misspelll|mspell|)/i .... meta PORTMANTEAU (__PM1||__PM2||...) score PORTMANTEAU 0.01 Yes, they could be just one horrendously long rule, but until SA allows multi-line regexes that would be both unreadable and a nightmare to maintain. As it is, my current approach is starting to get unwieldy once it uses more than 6-8 subsidiary rules. So, it occurred to me that another way of handling this would be to export this sort of search to a support server written in C and using the PCRE library to handle matching. The server would load and compile collections of regexes from a configuration file or database at start-up or when signalled and would perform all comparisons against in-memory tables. Each distinct regex collection would be named. A plugin would provide SA's interface to the server, passing it the collection name and text to be searched and returning a hit/miss result. Rules would use the eval() method to talk to the plugin, telling it what part of the message to test and which regex collection(s) to apply. The server should save some CPU cycles since, although it would need to work sequentially through a regex collection, it can stop as soon as it has a hit - something I think may not be possible with my portmanteau rule design. The server should be fairly easy to write since its logic is simple and, even if single threaded, could still be a shared resource used by multiple SA copies. It should be fairly small since most regex collections are unlikely to exceed 1K in text form: its major memory use would be the result of buffering large message bodies. So, is this idea worth pursuing? - is this something anybody apart from myself would use? - apart from that, is it a stupid idea and, if so, what's wrong with it? - are there obvious performance problems I haven't spotted: - am I right about all regexes in a portmanteau rule being applied to every message? - is the overhead of sending the body to the server a show-stopper? Martin