Wietse Venema put forth on 3/8/2011 10:39 AM: > Stan Hoeppner: >> So, the question is, which form of expression processes the "does not >> match" case faster? The fully qualified expression, or the simple >> expression? Noel mentioned that the fully qualified expressions will >> tend to process faster. Is this true? Is it true for both the >> "matches" and "does not match" case? > > I would expect better performance when patterns only match the text > that needs to be matched.
So this would mean the simpler expressions would be faster? That makes me wonder why Enemies List[1] uses complex expressions, each one precisely matching a specific rDNS pattern, given EL matches 65k+ patterns total. Likewise, the original author of my fqrdns.pcre table also used mostly expressions that exactly match a specific rDNS pattern, although in this case we have only 1600+ expressions so speed isn't as critical. I've not made 1 to 1 equivalent simpler expressions and run timing tests. It would be rather time consuming to copy the current table and simplify the expressions in the copy. I'm wondering now if execution times would show any meaningful difference. I wonder if testing just a small subset, say 100 expressions, would be sufficient to show meaningful execution time differences. > If you must match a very large numbers of patterns, you need an > implementation that transforms N patterns into one deterministic > automaton. This can match 1 pattern in the same time as N patterns. > Once the automaton is built (which takes some time) it is blindingly > fast. An example of such an implementation is flex. This sounds really interesting. Do you have a link to info about this flex software? I'd like to read about it. > Similar optimizations are needed for large CIDR maps. Right now, > Postfix's linear search does 10^8 patterns/s. With this, postscreen > can search the largest ipdeny.com file in 1ms on a modern CPU, > which is sufficient for the moment. To make it fast, the CIDR > entries need to be arranged into a tree that can be traversed in > log(N) time. I recall you and Viktor discussing this a while ago. I don't really understand how an OP (myself) would go about creating a tree of our CIDR tables. Or is this something that the Postfix CIDR code would handle? [1] Enemies List is not available for Postfix, yet, and the intelligence dataset is not free, although the source code is open. EL is integrated in some commercial AS appliances and commercial mail software. I mention it frequently here because it is the only antispam tool I'm aware of that makes almost exclusive use of regexes to identify likely spam sources, and it uses 10s of thousands of regexes. -- Stan