on Mon, Sep 13, 2004 at 09:39:15AM -0700, Rod Roark ([EMAIL PROTECTED]) wrote: > I ran across this during my morning reading: > > http://projects.puremagic.com/greylisting/whitepaper.html > > of which there seems to be a good Postfix implementation > (Postfix is my MTA of choice): > > http://isg.ee.ethz.ch/tools/postgrey/
OK, I gave that a 30 second scan. It fits in with a few of my own activities, of which spam profiling is one. > So I'm seriously considering putting this on my server. > > The effect on LUGOD would be: > > (1) Virtually no spam. Good luck. You _may_ significantly drop the spam load. Killing it outright is unlikely. More below. > Mostly this is of interest to the > officers, as the mailing lists already require > registration in order to post; however spammers might > easily forge the FROM header to abuse this. Note that the greylisting is based on a tuple of which at least one element (immediate upstream IP) is difficult or impossible to reliably forge. > (2) Mail from first-time posters, or from those who post > less frequently than once per month, would likely be > delayed by an hour or so. Possibly. > (3) This *might* allow me to eliminate the current blocking > of mail from dynamic IPs. ...iff (sic) the IP isn't a candidate for blocking under other criteria. > Comments? Sure. First: for a given receiving MTA, the _vast_ bulk of legit mail will appear to come from a handful of IPs, or failing that, netblocks. Just for kicks, I happen to have some 857+ mails in my lugod vox-tech folder. Let's get their upstream IPs, that is: the IP from which LUGoD's mailserver received the mail. This may not be the _ultimate_ origin, but it is the one _assured_ point of transit, and certainly has no business, say, spewing forth spam spewe. OK, I'm in my vox-tech/new Maildir directory: for f in $( ls ); do formail -cX "Received:" < $f | grep -m2 'by www.livepenguin.com' | grep -v 'ns1\.livepenguin\.com' done | sed -e 's/by www\.live.*//' -e 's/^.*\[//' -e 's/[])]//g' | tee /tmp/lugod-ips wc -l /tmp/lugod-ips sort -u < /tmp/lugod-ips | wc -l Gives 857 mails from 141 IPSs. Ok, that's a big handful.... Let's run these through the reverse-DNS service at asn.routeviews.org which lets us determine the ASN and CIDR associated with each IP. 'reverse_ip' is a bash shell function in my SpamTools kit which reverses the quads of an IP for rDNS queries: for ip in $( cat /tmp/lugod-ips ) do host -W 6 -R 10 -t txt $( reverse_ip $ip ).asn.routeviews.org done | sed -e 's/^.*text //" -e 's/"//g' We're now down to a total of 45 ASNs, of which 42 appear more than once: $ awk '{print $1}' /tmp/lugod-cidrs | sort | uniq -c | sort -nr | cat -n 1 160 7065 2 108 7132 3 82 5731 4 77 7961 [> half of all messages] 5 61 7018 6 60 22489 7 48 6192 8 41 4294967295 [unresolved] 9 41 26085 10 34 1698 11 34 10787 12 21 4265 13 20 15169 14 17 701 15 15 11403 16 14 26101 17 12 4355 18 11 6540 19 10 14779 20 9 6939 21 9 21566 22 9 2152 23 9 17175 24 7 7407 25 7 23310 26 7 11022 27 6 6478 28 6 29863 29 6 21844 30 6 174 31 6 14051 32 6 12076 33 4 25646 34 4 1742 35 3 6785 36 3 4151 37 3 3561 38 2 6517 39 2 26283 40 2 22799 41 2 12181 42 1 226 43 1 209 44 1 15687 45 1 14829 ...which is getting to the neighborhood of what I'd consider to be "a handful". *Half* of all mail comes from four ASNs. The "4294967295" value, BTW, is what routeviews.org returns for an unknown IP -- the data aren't perfect. We can also get CIDR from the string (it's the third and fourth columns in my output file). Turns out the spread isn't too much more -- 64 CIDRs, of which 24 appear more than once: $ awk '{printf( "%s/%s\n", $2, $3)}' /tmp/lugod-cidrs | sort | uniq -c | sort -nr | cat -n 1 142 64.142.0.0/19 2 77 198.144.192.0/19 3 74 168.150.0.0/16 4 61 204.127.128.0/17 5 48 169.237.0.0/16 6 41 66.163.160.0/19 [> half of all mail] 7 41 0/0 8 34 216.57.64.0/20 9 34 207.115.32.0/19 10 33 204.127.200.0/21 11 30 69.55.224.0/20 12 28 204.127.192.0/21 13 21 216.148.224.0/22 14 21 216.148.224.0/19 15 17 207.247.0.0/16 16 16 63.192.0.0/12 17 15 69.55.238.0/24 18 15 69.55.237.0/24 19 15 66.111.0.0/20 20 15 208.201.224.0/19 21 14 66.218.64.0/19 22 11 209.210.251.0/24 23 11 207.217.0.0/16 24 10 64.233.170.0/24 25 10 64.233.160.0/19 26 10 206.190.32.0/20 27 9 212.165.128.0/17 28 9 208.184.190.0/23 29 9 130.86.0.0/16 30 8 158.222.0.0/16 31 7 63.101.96.0/21 32 7 209.239.32.0/19 33 7 209.232.0.0/15 34 7 199.233.217.0/24 35 6 69.56.128.0/17 36 6 65.54.224.0/19 37 6 38.0.0.0/8 38 6 209.151.64.0/19 39 4 64.62.128.0/18 40 4 64.62.128.0/17 41 4 24.2.32.0/19 42 4 209.79.220.0/22 43 4 134.174.0.0/16 44 3 66.120.0.0/13 45 3 64.142.64.0/19 46 3 217.157.0.0/16 47 3 209.225.0.0/18 48 3 147.49.0.0/16 49 2 66.60.128.0/18 50 2 66.54.152.0/23 51 2 66.54.128.0/17 52 2 24.207.0.0/18 53 2 216.93.192.0/19 54 2 216.86.192.0/19 55 1 67.172.160.0/19 56 1 67.169.224.0/20 57 1 66.60.130.0/24 58 1 66.60.129.0/24 59 1 65.19.128.0/18 60 1 217.16.96.0/20 61 1 207.69.200.0/24 62 1 207.159.64.0/18 63 1 207.159.120.0/24 64 1 130.221.0.0/16 The handily useful thing about ASNs and CIDRs are: - They aggregate beautifully. A wide range of IPs clusters into a narrow band of CIDRs or ASNs. So both your spamhaus with a large number of IPs trickling out a small number of spams each, and your friendly neighborhood ISP with a few hundred white hats scattered over a /24 or /18, cluster nicely. - The data's a DNS query away. And the zonefiles are rsyncable. - The spam/ham determination can be as local and specific as you want. - Organizationally, ASNs and CIDRs both map to what's typically a single entity with effective control over its network. How it uses that control, and whether for good or for bad, is its business. But the data are readily and immediately available to you. Where I see the next generation of MTAs headed is keeping track of sender reputation not on the basis of an individual IP's track record (the classic DNSBL model), but on the record of blocks of IPs. If you think about the implications of IPv6 (effectively limitless address space), you'll *have* to utilize an aggregating tool to be able to use reputation-based tools effectively (of course, IPv6 appears to be a ways off for other reasons as well....). My own data suggest that the bulk of spam, as the bulk of mail on a list, originate from a small number of identifiable sources. One ASN regularly accounts for between 12%-18% of my own spam (Kornet's 4766). The top four ASNs are 25% of my spam, the top 20 or so, 50%. Which suggests a very cheap mode of cutting into spam volumes markedly by employing ASNs, CIDRs, or similar IP aggregates (though I'm aware of none) in generating reputation data, and effecting firewalling, probabalistic rejection (you reject traffic from an ASN directly proportional to the probability it's spam), rate-limiting, etc. Backing off from a black-and-white allow/deny mode gives legit mail a fighting chance.... Which all sounds well and good. The question, though, is how much spam are you getting? There are two large-volume, well-known lists for which I'm aware of spam stats being available, comp.risks and the debian-user mailing list. Comp.risks declared in 2001 that it had reached the spam crossover: even with filtering, over 50% of the mail received in the moderator's inbox was spam. As of October 2003, with SpamAssassin catching > 1000 spams daily, *90%* of the remaining volume was spam: http://catless.ncl.ac.uk/Risks/22.92.html#subj9.1 Debian-user currently rejects > 95% of all mail based on various rules. Let's say you've got a list that receives 90% spam, and you introduce point-of-origin filtering at the 50% cutoff (kill any aggregated network in the first 50%ile spam contributors list). Congratulations, you've just eliminated half your spam with a single 20-element rule, based on your own experience. Your list also _still_ receives 45% spam. It's a matter of both the amount of spam you can cut, and the total volume of spam you're receiving. On the other hand, content/context based filtering gets expensive both CPU and time-wise, particularly if you're making extensive use of DNSBLs (they're useful data sources, they're time-intensive). It takes me 10-20 seconds to determine spam or ham on my own system, on a high-speed line, via Spamassassin. I'm faster doing it manually, but I'm not going to sit in hour after hour, day in and day out. So the machine does it. My own read: any network in the top-50% range, or whose net mail contribution is > 50% spam, has no business delivering legitimate _packets_, let alone mail, and should be firewalled. I see this as a network hygiene issues -- one of the administrators of a network adequately policing and ensureing that it doesn't spew crud over other people's networks. And if they're not going to make the effort to prevent this to their own satisfaction and needs, the rest of the Net's welcome to take whatever measures satisfy _their_ own business needs. So, making a long post, um, longer: reputation-based MTAs are a Good Thing[tm], and disposition of mail at SMTP time is the Right Way To Do It[tm]. It is not, however, the Total Solution[tm]. You're going to need content filtering. It's a nice big step though, and you _can_ use origin to, say, preserve your expensive filtering steps for the small number (by volume) of points-of-origin for which you don't have a good trust basis. Rod, does that answer your question ;-) Peace. -- Karsten M. Self <[EMAIL PROTECTED]> http://kmself.home.netcom.com/ What Part of "Gestalt" don't you understand? Erin Joyce: can't get the story right, won't correct it http://z.iwethey.org/forums/render/content/show?contentid=96625
signature.asc
Description: Digital signature
_______________________________________________ vox-tech mailing list [EMAIL PROTECTED] http://lists.lugod.org/mailman/listinfo/vox-tech