[I posted this to the Rules Emporium list. Maybe there will be some interest
here.
Theo, if you have a min., I'd appreciate your thoughts on the URI decoding
and
canonicalization, and how they line up with the 'uri' processing in SA.

As an aside, although Mail::Box::Manager gives it the old college try, it
gets
indigestion on the sorts of mail that the spammers throw at it. For example,

Premature end of base64 data at
/usr/lib/perl5/site_perl/5.8.0/Mail/Message/TransferEnc/Base64.pm line 62.
Premature end of base64 data at
/usr/lib/perl5/site_perl/5.8.0/Mail/Message/TransferEnc/Base64.pm line 62.
Premature end of base64 data at
/usr/lib/perl5/site_perl/5.8.0/Mail/Message/TransferEnc/Base64.pm line 62.
Premature padding of base64 data at
/usr/lib/perl5/site_perl/5.8.0/Mail/Message/TransferEnc/Base64.pm line 62.
WARNING: No decoder defined for transfer encoding 8 bit.

Probably better would be to use the Spamassassin API and let SA do the
parsing ...]



I posted an earlier version of this tool on the SA Talk list yesterday, but
that
note has yet to surface. This new improved version (attached) runs multiple
threads so that
the program doesn't block while waiting to see if the URL is valid.

It can be argued that for spam scanning purposes (ie, BigEvil) it isn't
important that
the URL resolve to a working web page, but only that it signals the presence
of spam.
I thought the checking for URL's would be more robust, however, if there was
an
underlying working web site, and certainly this is necessary if one want to
check
to see if the referenced site is spammy. Note also that the output of the
program
decodes obfuscated URL's, and converts them into a canonical human-readable
form.
This often is not the way that the URL actually appeared inside the e-mail
message,
and may or may not conform to the way that this info appears SA's 'uri'
test, though
I think that in general checks for domains will work as expected.


Inspired by "Filters that fight back", by Paul Graham
 http://www.paulgraham.com/ffb.html
I found a reference to a short script that scans e-mail for URL's,
and then turns around and automatically references the offending page.
Well, I'm not interested in doing that at the moment, but I have
enhanced the script (and fixed a bug) to make it do a decent job of
extracting e-mails from an mbox. At the moment, it de-obfuscates the
embedded URL's, and then attempts to validate them by doing an HTTP 'get'.
For valid URL's, it prints two tab separated fields: the host name, and
the URL. Here's an example of the output:

qznvtmdwct.cokjz.biz http://qznvtmdwct.cokjz.biz/patch/?hpsales
dia55.us        http://[EMAIL PROTECTED]/vp/?dia1900
12hen.info      http://[EMAIL PROTECTED]/vp/o.html
dia55.us        http://[EMAIL PROTECTED]/patch/?dia1900
12hen.info      http://[EMAIL PROTECTED]/vp/?zsxdc
sgkmfecsix.cokjz.biz http://sgkmfecsix.cokjz.biz/patch/?hpsales
biz.yahoo.com   http://biz.yahoo.com/prnews/040106/hktu006_1.html
www.getinfohere.net  http://www.getinfohere.net/mcp/104/2581/cap112.html
bigcharts.marketwatch.com
http://bigcharts.marketwatch.com/intchart/frames/main.asp?time=2&freq=8&comp
idx=aaaaa:0&comp=NO_SYMBOL_CHOSEN&ma=0&maval=9&uf=0&lf=1&lf2=65536&lf3=512&t
ype=8&style=360&size=2&sid=0&o_symb=UGHO&startdate=&enddate=&show=true&symb=
ugho&draw.x=53&draw.y=5

Note: since the Mail::Box::Manager module reads the entire mail folder into
memory,
it is probably a good idea to process mail in batches of say, 10000 messages
or so.
I use a split_mail script (also attached) to do that. Also, the mail parser
in
Mail::Box::Manager has problems with alternate character sets in the headers
and
various other syntactic oddities, so will reject a few of the mail messages
with
a noisy complaint but will process the other messages.

In the output above, it might make sense to convert the domains to "reverse
dot"
notation in order to more easily find common host domains there. In an ideal
world,
all this stuff would be added to a database so that it can be sorted and
retrieved
in various ways.

Attachment: split_mail.dat
Description: Binary data

Attachment: extract_urls.pl
Description: Binary data

Reply via email to