Re: regular expressions was: Kernel Oops

Stan Hoeppner Tue, 08 Mar 2011 12:29:54 -0800

Wietse Venema put forth on 3/8/2011 10:39 AM:
> Stan Hoeppner:
>> So, the question is, which form of expression processes the "does not
>> match" case faster?  The fully qualified expression, or the simple
>> expression?  Noel mentioned that the fully qualified expressions will
>> tend to process faster.  Is this true?  Is it true for both the
>> "matches" and "does not match" case?
> 
> I would expect better performance when patterns only match the text
> that needs to be matched.


So this would mean the simpler expressions would be faster?  That makes
me wonder why Enemies List[1] uses complex expressions, each one
precisely matching a specific rDNS pattern, given EL matches 65k+
patterns total.  Likewise, the original author of my fqrdns.pcre table
also used mostly expressions that exactly match a specific rDNS pattern,
although in this case we have only 1600+ expressions so speed isn't as
critical.

I've not made 1 to 1 equivalent simpler expressions and run timing
tests.  It would be rather time consuming to copy the current table and
simplify the expressions in the copy.  I'm wondering now if execution
times would show any meaningful difference.  I wonder if testing just a
small subset, say 100 expressions, would be sufficient to show
meaningful execution time differences.

> If you must match a very large numbers of patterns, you need an
> implementation that transforms N patterns into one deterministic
> automaton. This can match 1 pattern in the same time as N patterns.
> Once the automaton is built (which takes some time) it is blindingly
> fast. An example of such an implementation is flex.

This sounds really interesting.  Do you have a link to info about this
flex software?  I'd like to read about it.

> Similar optimizations are needed for large CIDR maps. Right now,
> Postfix's linear search does 10^8 patterns/s. With this, postscreen
> can search the largest ipdeny.com file in 1ms on a modern CPU,
> which is sufficient for the moment. To make it fast, the CIDR
> entries need to be arranged into a tree that can be traversed in
> log(N) time.

I recall you and Viktor discussing this a while ago.  I don't really
understand how an OP (myself) would go about creating a tree of our CIDR
tables.  Or is this something that the Postfix CIDR code would handle?


[1] Enemies List is not available for Postfix, yet, and the intelligence
dataset is not free, although the source code is open.  EL is integrated
in some commercial AS appliances and commercial mail software.  I
mention it frequently here because it is the only antispam tool I'm
aware of that makes almost exclusive use of regexes to identify likely
spam sources, and it uses 10s of thousands of regexes.

-- 
Stan

Re: regular expressions was: Kernel Oops

Reply via email to