Re: [OT] "Boosting" and other potential research topics

Scott A Crosby Mon, 23 May 2005 17:50:39 -0700

On Tue, 17 May 2005 14:01:09 +0100, Henry Stern <[EMAIL PROTECTED]> writes:


> 3.  Faster body scanning.
>
> Every body rule in SpamAssassin requires a separate pass through the
> message.  The time complexity of this is O(n).  If all of the rules
> are combined into one using a trie structure, the rules can be
> evaluated in O(log n) time.

> However, using the latter method it still requires a (less
> expensive) O(n) operation to find which rules have been satisfied.
> A valuable area of research that will really help SpamAssassin would
> be to develop a method of evaluating all of the regular expressions
> in O(log n) time and finding exactly which rules have been satisfied
> in O(log n) time.

I prototyped an implementation of this. My conclusion was that it
wouldn't be effective because too many of the regexp's SA uses can not
be represented as DFA's. Also, it wouldn't lead to a huge improvement
because perl's regexp engine doesn't dominate the runtime of SA, and,
would probably become less signifigant with time.

If you're interested in the best algorithms I found, if there are $k$
regexps, $m$ matches and the input is of length $n$. Note that $m$
also counts multiple matches by the same rule at different positions.

Worst-case time to match regexps is O(n+m) if you don't care about
overlapping matches and O(n^2+m)[1] if you do want them.[2] For all
practical purposes, it would take time linear in the input text.

If you only care about fixed strings (no regexps), then matching any
number in parallel can be done in O(n+m) in the worst case.[3]


Scott

[1] The n^2 is because a regexp may match up to O(n) text, and you
have to run the automata at all O(n) positions. In practice, this
shouldn't be any more of an issue than it is with the existing perl
regexp matcher. 

[2] Straightforward modification of the algorithm given in the dragon
book (forget the chapter), where you record each match as you discover
it instead of just outputting the last-found/longest match.

[3] Straightforward modification of Aho-corasick, where you record
each match as you discover it instead of just outputting the
last-found/longest match.

Re: [OT] "Boosting" and other potential research topics

Reply via email to