On Tue, 17 May 2005 14:01:09 +0100, Henry Stern <[EMAIL PROTECTED]> writes:
> 3. Faster body scanning. > > Every body rule in SpamAssassin requires a separate pass through the > message. The time complexity of this is O(n). If all of the rules > are combined into one using a trie structure, the rules can be > evaluated in O(log n) time. > However, using the latter method it still requires a (less > expensive) O(n) operation to find which rules have been satisfied. > A valuable area of research that will really help SpamAssassin would > be to develop a method of evaluating all of the regular expressions > in O(log n) time and finding exactly which rules have been satisfied > in O(log n) time. I prototyped an implementation of this. My conclusion was that it wouldn't be effective because too many of the regexp's SA uses can not be represented as DFA's. Also, it wouldn't lead to a huge improvement because perl's regexp engine doesn't dominate the runtime of SA, and, would probably become less signifigant with time. If you're interested in the best algorithms I found, if there are $k$ regexps, $m$ matches and the input is of length $n$. Note that $m$ also counts multiple matches by the same rule at different positions. Worst-case time to match regexps is O(n+m) if you don't care about overlapping matches and O(n^2+m)[1] if you do want them.[2] For all practical purposes, it would take time linear in the input text. If you only care about fixed strings (no regexps), then matching any number in parallel can be done in O(n+m) in the worst case.[3] Scott [1] The n^2 is because a regexp may match up to O(n) text, and you have to run the automata at all O(n) positions. In practice, this shouldn't be any more of an issue than it is with the existing perl regexp matcher. [2] Straightforward modification of the algorithm given in the dragon book (forget the chapter), where you record each match as you discover it instead of just outputting the last-found/longest match. [3] Straightforward modification of Aho-corasick, where you record each match as you discover it instead of just outputting the last-found/longest match.
