on Wed, Mar 09, 2011 at 12:03:27AM +0100, mouss wrote: > [WARNING: Steven CC'd]
:-) > Le 08/03/2011 21:29, Stan Hoeppner a écrit : > > That makes me wonder why Enemies List[1] uses complex expressions, > > each one precisely matching a specific rDNS pattern, given EL > > matches 65k+ patterns total. Eh, it varies quite a bit, some of them are complex groups like this: [0-9]+\-[0-9]+\-[0-9]+\-[0-9]+\.dynamic\.(brasov|craiova|fagaras|resita|sfantugheorghe|victoria|zarnesti)\.rdsnet\.ro because for whatever reason I can't just use a [0-9a-z\-]+ in place of the group, or because they just grew over time as I saw more hosts. But some are relatively simple: [0-9a-z\-]+\-[0-9]+\.fiberlink\.[a-z]+\.rdsnet\.ro wherever I can get away with it. You have to be careful with blanket "alphanumeric token" host parts, because sometimes you're matching a city or town or state or abbreviation and everything's fine, and then the ISP starts putting 'mail' or 'static' in that token's position in a similar hostname and suddenly you're blocking more than residential dynamic cable modems. :-/ eg [0-9]+\-[0-9]+\-[0-9]+\-[0-9]+\.mail[0-9]+\.fft\.com\.au [0-9]+\-[0-9]+\-[0-9]+\-[0-9]+\.mail\.eletti\.com\.br [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\.mail\.sistemairis\.com\.br I haven't really tried to optimize the regular expressions, because of the way our library processes them - by walking down a tree from '.' (so, '.' -> ro -> rdsnet -> all the patterns for rdsnet.ro) - so perf is acceptable (several hundred thousand matches/sec on decent hardware; ~225K lookups/s on my old Macbook via C program). Oh, and we're long past 65K - last build was 74494 patterns. I keep forgetting to update the Web site. :-) > as said above, the goal isn't performance (to improve performance, buy > better hardware or run multiple instances). Well, no, the goal is acceptable performance, but also managable update mechanisms that allow for rapid correction of FP classifications. > The goal of Steven is to maximize hit rate while minimizing false > positives. many of us have created rules to block > generic/dynamic/silly senders. when doing so, you can start by being > precise at the risk of doing a lot of work because your rules minimise > FPs, or going the other side by using expressions that block a lot of > senders inclusing legitimate ones, that is increasing the FP rate. it > takes time and efforts to get a good balance, and that's what Steven > work is about. Yup. And it took me a few months to really understand that the useful concept of a 'generic' hostname also unfortunately also applied to large mail farms that we wanted mail from. (Now we track 'outmx' patterns, too, and they account for around an eighth of all the patterns we have. Same goes for 'webhost' - we mostly just see phishing scams from most of them, but when you're analyzing someone's mailflow it helps to be able to tell them which of their mail is coming from legit or quasi-legit mail sources.) I used to have a few hundred "compact" expressions, like this, which were left-anchored but not fully qualified: %compact = ( "duN" => 'du[0-9]+', "dynN" => 'dyn[0-9]+', "pppN" => 'ppp[0-9]+', "N-N-N" => '[0-9]+\-[0-9]+\-[0-9]+', "dhcpH" => 'dhcp[0-9a-f]+', "dhcpN" => 'dhcp[0-9]+', "dialN" => 'dial[0-9]+', "duN-N" => 'du[0-9]+\-[0-9]+', "dyn-N" => 'dyn\-[0-9]+', "portN" => 'port[0-9]+', "ppp-N" => 'ppp\-[0-9]+', "dhcp-N" => 'dhcp\-[0-9]+', "dial-N" => 'dial\-[0-9]+', "dialup" => 'dialup', "du-N-N" => 'du\-[0-9]+\-[0-9]+', "dynN-N" => 'dyn[0-9]+\-[0-9]+', "port-N" => 'port\-[0-9]+', [...] but frankly the FP rate was so awful I ditched them. And not just because of silly people like whoever set up Marriott's reservations transactional servers with names like host184.marriott.com, but they were one very big reason why I ditched them. > >[snip] > > > >> If you must match a very large numbers of patterns, you need an > >> implementation that transforms N patterns into one deterministic > >> automaton. This can match 1 pattern in the same time as N patterns. > >> Once the automaton is built (which takes some time) it is blindingly > >> fast. An example of such an implementation is flex. > > > > This sounds really interesting. Do you have a link to info about this > > flex software? I'd like to read about it. Oh, that was what we tried first. Matt Sergeant wrote a perl wrapper around a hunk of C object code that we generated using re2c. Worked fine, you feed it regexes, it generates C code, you compile it into an object and call it from a simple perl DNS server, voila. That was how I provided the first instance of the Enemieslist via DNSBL, for a year or so, on a Mac Mini. As far as the code went, it worked great. Unfortunately, it took almost an hour to compile, and that was back when I "only" had a few thousand patterns. Oh, and you had to recompile every time you wanted to change *any* pattern. Oh, and then you had to scp it over to the DNS server(s), stop the server, swap in the new object, restart the server, and hope it didn't have any errors. Oh, and FWIW it only supported around 400 queries/second. So, given that 99% of the patterns were were looking at were based on (derived from, whatever) hostnames, with all the hierarchicality that implies, it made sense to treat the individual patterns as leaves hanging off a fast walk down the DNS tree, so you don't have to compile the whole thing every time or anything crazy like that. So we quickly moved to a fast tree-based regex library, patched it into rbldnsd, and that's what we've been using for the past three or four years. Distribution of the patterns is via rsync of a flat file that rbldnsd knows how to re-read whenever it changes. *Much* nicer model. And it supports something more on the order of thirty thousands of queries/second, a couple of orders of magnitude better. http://enemieslist.com/dnsbl/ Feel free to poke at the code, it's distributed under a modified BSD license, as that's not where the value is for Enemieslist. > > [1] Enemies List is not available for Postfix, yet, and the > > intelligence dataset is not free, although the source code is open. > > EL is integrated in some commercial AS appliances and commercial > > mail software. I mention it frequently here because it is the only > > antispam tool I'm aware of that makes almost exclusive use of > > regexes to identify likely spam sources, and it uses 10s of > > thousands of regexes. > > I don't use EL, but I think it is usable with postfix. Steven, can you > confirm this? (some of the features may be sendmail oriented, but it > would be easy to "generalize" them). There's three bits of EL: 1) a huge pile of regexes, classified by assignment and other types, useful for risk assessment on among other things, SMTP actors; it is platform-agnostic, though you can query it via DNSBL (if your MTA knows how, or can be made to, do that). 2) a huge bunch of m4 files that implement a staggering array of now mostly stale checks on inbound mail, using the awesome power of sendmail's config-file-as-basic-programming-language. It is very specific to sendmail, but could presumably be generalized if anyone really cared to do so. 3) a SpamAssassin plugin that makes it possible to score on the result of an A record lookup of a HELO or PTR (eventually, we'll add TXT support, too, for finer-grained control) #2 was built to make use of #1, and I suppose we thought we'd turn it into a Barracuda-style appliance model, but #1 started attracting attention from companies like that so we decided to focus. #1 was formerly compiled directly into the sendmail.cf, with all the sheer horrors associated with same, distributed as a flat file for use with Postfix via check_client_access and a regexp map, and also for Exim, through a deny config associated with a similar flat file. Those were the only MTAs I ever supported with native file formats; then after we moved the patterns behind the dnsbl wall, it seemed pointless to worry about others. #2 is more or less stagnant, because it has more handles than an apothecary chest, and takes a few hours to install and configure and most people don't want to waste their lives that way. Naturally, #2 is available to anyone who wants it, and has a few thousand users on several systems, but it's not where the focus of our efforts goes. FWIW, I've been running it here on our production mail servers for going on eight years. Every now and again I tweak it a little to address some new quirk but for the most part it just looks up HELO and PTR in enemieslist.com's dnsbl and blocks 99%+ of our inbound spam (so far this month, it's blocked 12233 messages and allowed in around 60, not counting maybe a couple dozen or so 419 scams, which I track differently). So I keep it running here. #1 is now available via DNSBL lookup, and once I get a few more mirrors set up and running we'll probably open it up for non-commercial or low-volume use, either directly via whatever hackery must be done to make an MTA check HELO and PTR against a DNSBL, or via #3, for which there is already a plugin with basic support available. http://enemieslist.com/how/use.html http://enemieslist.com/how/spamassassin.html And yes, #1 is not free or open at this time, but is available for commercial license (we learned almost twenty years ago that giving things away for free might make you famous but charging for them will let people use them while enabling you to continue to maintain and improve them). And for the moment, we're focused on the big ISPs and mailbox providers, because they're who has the worst need and who can pay enough for a license to support the project. Once we land a few more licenses, we'll look at how to make it more widely available for everyone else. HTH, Steve -- hesketh.com/inc. v: +1(919)834-2552 f: +1(919)834-2553 w: http://hesketh.com/ antispam news and intelligence to help you stop spam: http://enemieslist.com/