Re: regular expressions was: Kernel Oops

Steven Champeon Tue, 08 Mar 2011 17:01:45 -0800

on Wed, Mar 09, 2011 at 12:03:27AM +0100, mouss wrote:
> [WARNING: Steven CC'd]


:-)
 
> Le 08/03/2011 21:29, Stan Hoeppner a écrit :
> > That makes me wonder why Enemies List[1] uses complex expressions,
> > each one precisely matching a specific rDNS pattern, given EL
> > matches 65k+ patterns total.

Eh, it varies quite a bit, some of them are complex groups like this:

[0-9]+\-[0-9]+\-[0-9]+\-[0-9]+\.dynamic\.(brasov|craiova|fagaras|resita|sfantugheorghe|victoria|zarnesti)\.rdsnet\.ro

because for whatever reason I can't just use a [0-9a-z\-]+ in place of
the group, or because they just grew over time as I saw more hosts. But
some are relatively simple:

[0-9a-z\-]+\-[0-9]+\.fiberlink\.[a-z]+\.rdsnet\.ro

wherever I can get away with it. You have to be careful with blanket
"alphanumeric token" host parts, because sometimes you're matching a
city or town or state or abbreviation and everything's fine, and then
the ISP starts putting 'mail' or 'static' in that token's position in
a similar hostname and suddenly you're blocking more than residential
dynamic cable modems. :-/

eg [0-9]+\-[0-9]+\-[0-9]+\-[0-9]+\.mail[0-9]+\.fft\.com\.au
   [0-9]+\-[0-9]+\-[0-9]+\-[0-9]+\.mail\.eletti\.com\.br
   [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\.mail\.sistemairis\.com\.br

I haven't really tried to optimize the regular expressions, because of
the way our library processes them - by walking down a tree from '.'
(so, '.' -> ro -> rdsnet -> all the patterns for rdsnet.ro) - so perf is
acceptable (several hundred thousand matches/sec on decent hardware;
~225K lookups/s on my old Macbook via C program).

Oh, and we're long past 65K - last build was 74494 patterns. I keep
forgetting to update the Web site. :-)

> as said above, the goal isn't performance (to improve performance, buy
> better hardware or run multiple instances).

Well, no, the goal is acceptable performance, but also managable update
mechanisms that allow for rapid correction of FP classifications. 

> The goal of Steven is to maximize hit rate while minimizing false
> positives. many of us have created rules to block
> generic/dynamic/silly senders. when doing so, you can start by being
> precise at the risk of doing a lot of work because your rules minimise
> FPs, or going the other side by using expressions that block a lot of
> senders inclusing legitimate ones, that is increasing the FP rate. it
> takes time and efforts to get a good balance, and that's what Steven
> work is about.

Yup. And it took me a few months to really understand that the useful
concept of a 'generic' hostname also unfortunately also applied to large
mail farms that we wanted mail from. (Now we track 'outmx' patterns,
too, and they account for around an eighth of all the patterns we have.
Same goes for 'webhost' - we mostly just see phishing scams from most of
them, but when you're analyzing someone's mailflow it helps to be able
to tell them which of their mail is coming from legit or quasi-legit
mail sources.)

I used to have a few hundred "compact" expressions, like this, which were
left-anchored but not fully qualified:

%compact = (
               "duN" => 'du[0-9]+',
              "dynN" => 'dyn[0-9]+',
              "pppN" => 'ppp[0-9]+',
             "N-N-N" => '[0-9]+\-[0-9]+\-[0-9]+',
             "dhcpH" => 'dhcp[0-9a-f]+',
             "dhcpN" => 'dhcp[0-9]+',
             "dialN" => 'dial[0-9]+',
             "duN-N" => 'du[0-9]+\-[0-9]+',
             "dyn-N" => 'dyn\-[0-9]+',
             "portN" => 'port[0-9]+',
             "ppp-N" => 'ppp\-[0-9]+',
            "dhcp-N" => 'dhcp\-[0-9]+',
            "dial-N" => 'dial\-[0-9]+',
            "dialup" => 'dialup',
            "du-N-N" => 'du\-[0-9]+\-[0-9]+',
            "dynN-N" => 'dyn[0-9]+\-[0-9]+',
            "port-N" => 'port\-[0-9]+',

[...]

but frankly the FP rate was so awful I ditched them. And not just
because of silly people like whoever set up Marriott's reservations
transactional servers with names like host184.marriott.com, but they
were one very big reason why I ditched them.
 
> >[snip]
> > 
> >> If you must match a very large numbers of patterns, you need an
> >> implementation that transforms N patterns into one deterministic
> >> automaton. This can match 1 pattern in the same time as N patterns.
> >> Once the automaton is built (which takes some time) it is blindingly
> >> fast. An example of such an implementation is flex.
> > 
> > This sounds really interesting.  Do you have a link to info about this
> > flex software?  I'd like to read about it.

Oh, that was what we tried first. Matt Sergeant wrote a perl wrapper
around a hunk of C object code that we generated using re2c. Worked
fine, you feed it regexes, it generates C code, you compile it into
an object and call it from a simple perl DNS server, voila. That was
how I provided the first instance of the Enemieslist via DNSBL, for
a year or so, on a Mac Mini. As far as the code went, it worked great.

Unfortunately, it took almost an hour to compile, and that was back
when I "only" had a few thousand patterns. Oh, and you had to recompile
every time you wanted to change *any* pattern. Oh, and then you had
to scp it over to the DNS server(s), stop the server, swap in the new
object, restart the server, and hope it didn't have any errors. Oh,
and FWIW it only supported around 400 queries/second.

So, given that 99% of the patterns were were looking at were based on
(derived from, whatever) hostnames, with all the hierarchicality that
implies, it made sense to treat the individual patterns as leaves
hanging off a fast walk down the DNS tree, so you don't have to compile
the whole thing every time or anything crazy like that.

So we quickly moved to a fast tree-based regex library, patched it into
rbldnsd, and that's what we've been using for the past three or four
years. Distribution of the patterns is via rsync of a flat file that
rbldnsd knows how to re-read whenever it changes. *Much* nicer model.
And it supports something more on the order of thirty thousands of
queries/second, a couple of orders of magnitude better.

 http://enemieslist.com/dnsbl/

Feel free to poke at the code, it's distributed under a modified BSD
license, as that's not where the value is for Enemieslist.

> > [1] Enemies List is not available for Postfix, yet, and the
> > intelligence dataset is not free, although the source code is open.
> > EL is integrated in some commercial AS appliances and commercial
> > mail software. I mention it frequently here because it is the only
> > antispam tool I'm aware of that makes almost exclusive use of
> > regexes to identify likely spam sources, and it uses 10s of
> > thousands of regexes.
> 
> I don't use EL, but I think it is usable with postfix. Steven, can you
> confirm this? (some of the features may be sendmail oriented, but it
> would be easy to "generalize" them).

There's three bits of EL:

 1) a huge pile of regexes, classified by assignment and other types,
    useful for risk assessment on among other things, SMTP actors; it
    is platform-agnostic, though you can query it via DNSBL (if your
    MTA knows how, or can be made to, do that).

 2) a huge bunch of m4 files that implement a staggering array of now
    mostly stale checks on inbound mail, using the awesome power of
    sendmail's config-file-as-basic-programming-language. It is very
    specific to sendmail, but could presumably be generalized if anyone
    really cared to do so.

 3) a SpamAssassin plugin that makes it possible to score on the result
    of an A record lookup of a HELO or PTR (eventually, we'll add TXT
    support, too, for finer-grained control)

#2 was built to make use of #1, and I suppose we thought we'd turn
it into a Barracuda-style appliance model, but #1 started attracting
attention from companies like that so we decided to focus.

#1 was formerly compiled directly into the sendmail.cf, with all the
sheer horrors associated with same, distributed as a flat file for use
with Postfix via check_client_access and a regexp map, and also for
Exim, through a deny config associated with a similar flat file. Those
were the only MTAs I ever supported with native file formats; then
after we moved the patterns behind the dnsbl wall, it seemed pointless
to worry about others.

#2 is more or less stagnant, because it has more handles than an
apothecary chest, and takes a few hours to install and configure and
most people don't want to waste their lives that way. Naturally, #2 is
available to anyone who wants it, and has a few thousand users on
several systems, but it's not where the focus of our efforts goes. FWIW,
I've been running it here on our production mail servers for going on
eight years. Every now and again I tweak it a little to address some new
quirk but for the most part it just looks up HELO and PTR in
enemieslist.com's dnsbl and blocks 99%+ of our inbound spam (so far this
month, it's blocked 12233 messages and allowed in around 60, not
counting maybe a couple dozen or so 419 scams, which I track
differently). So I keep it running here.

#1 is now available via DNSBL lookup, and once I get a few more mirrors
set up and running we'll probably open it up for non-commercial or
low-volume use, either directly via whatever hackery must be done to
make an MTA check HELO and PTR against a DNSBL, or via #3, for which
there is already a plugin with basic support available.

 http://enemieslist.com/how/use.html
 http://enemieslist.com/how/spamassassin.html

And yes, #1 is not free or open at this time, but is available for
commercial license (we learned almost twenty years ago that giving
things away for free might make you famous but charging for them will
let people use them while enabling you to continue to maintain and
improve them). And for the moment, we're focused on the big ISPs and
mailbox providers, because they're who has the worst need and who can
pay enough for a license to support the project. Once we land a few more
licenses, we'll look at how to make it more widely available for
everyone else.

HTH,
Steve

-- 
hesketh.com/inc. v: +1(919)834-2552 f: +1(919)834-2553 w: http://hesketh.com/
antispam news and intelligence to help you stop spam: http://enemieslist.com/

Re: regular expressions was: Kernel Oops

Reply via email to