[I posted this to the Rules Emporium list. Maybe there will be some interest here. Theo, if you have a min., I'd appreciate your thoughts on the URI decoding and canonicalization, and how they line up with the 'uri' processing in SA.
As an aside, although Mail::Box::Manager gives it the old college try, it gets indigestion on the sorts of mail that the spammers throw at it. For example, Premature end of base64 data at /usr/lib/perl5/site_perl/5.8.0/Mail/Message/TransferEnc/Base64.pm line 62. Premature end of base64 data at /usr/lib/perl5/site_perl/5.8.0/Mail/Message/TransferEnc/Base64.pm line 62. Premature end of base64 data at /usr/lib/perl5/site_perl/5.8.0/Mail/Message/TransferEnc/Base64.pm line 62. Premature padding of base64 data at /usr/lib/perl5/site_perl/5.8.0/Mail/Message/TransferEnc/Base64.pm line 62. WARNING: No decoder defined for transfer encoding 8 bit. Probably better would be to use the Spamassassin API and let SA do the parsing ...] I posted an earlier version of this tool on the SA Talk list yesterday, but that note has yet to surface. This new improved version (attached) runs multiple threads so that the program doesn't block while waiting to see if the URL is valid. It can be argued that for spam scanning purposes (ie, BigEvil) it isn't important that the URL resolve to a working web page, but only that it signals the presence of spam. I thought the checking for URL's would be more robust, however, if there was an underlying working web site, and certainly this is necessary if one want to check to see if the referenced site is spammy. Note also that the output of the program decodes obfuscated URL's, and converts them into a canonical human-readable form. This often is not the way that the URL actually appeared inside the e-mail message, and may or may not conform to the way that this info appears SA's 'uri' test, though I think that in general checks for domains will work as expected. Inspired by "Filters that fight back", by Paul Graham http://www.paulgraham.com/ffb.html I found a reference to a short script that scans e-mail for URL's, and then turns around and automatically references the offending page. Well, I'm not interested in doing that at the moment, but I have enhanced the script (and fixed a bug) to make it do a decent job of extracting e-mails from an mbox. At the moment, it de-obfuscates the embedded URL's, and then attempts to validate them by doing an HTTP 'get'. For valid URL's, it prints two tab separated fields: the host name, and the URL. Here's an example of the output: qznvtmdwct.cokjz.biz http://qznvtmdwct.cokjz.biz/patch/?hpsales dia55.us http://[EMAIL PROTECTED]/vp/?dia1900 12hen.info http://[EMAIL PROTECTED]/vp/o.html dia55.us http://[EMAIL PROTECTED]/patch/?dia1900 12hen.info http://[EMAIL PROTECTED]/vp/?zsxdc sgkmfecsix.cokjz.biz http://sgkmfecsix.cokjz.biz/patch/?hpsales biz.yahoo.com http://biz.yahoo.com/prnews/040106/hktu006_1.html www.getinfohere.net http://www.getinfohere.net/mcp/104/2581/cap112.html bigcharts.marketwatch.com http://bigcharts.marketwatch.com/intchart/frames/main.asp?time=2&freq=8&comp idx=aaaaa:0&comp=NO_SYMBOL_CHOSEN&ma=0&maval=9&uf=0&lf=1&lf2=65536&lf3=512&t ype=8&style=360&size=2&sid=0&o_symb=UGHO&startdate=&enddate=&show=true&symb= ugho&draw.x=53&draw.y=5 Note: since the Mail::Box::Manager module reads the entire mail folder into memory, it is probably a good idea to process mail in batches of say, 10000 messages or so. I use a split_mail script (also attached) to do that. Also, the mail parser in Mail::Box::Manager has problems with alternate character sets in the headers and various other syntactic oddities, so will reject a few of the mail messages with a noisy complaint but will process the other messages. In the output above, it might make sense to convert the domains to "reverse dot" notation in order to more easily find common host domains there. In an ideal world, all this stuff would be added to a database so that it can be sorted and retrieved in various ways.
split_mail.dat
Description: Binary data
extract_urls.pl
Description: Binary data
