I've been reading a little about existing spam tools, and a program called "Send-Safe" seems to be a popular one. It takes various measures to get around filters and bulk mail detectors, but the authors are kind enough to tell us how in http://www.send-safe.com/manual/ . Sounds like that might be a good basis for a Tokenizer - it could detect things like long blocks of whitespace in the subject line, suspiciously encoded URLs, etc.
Here's one I've been seeing lately... subject lines and message bodies that look like this:
"S'end you'r Ad's to 3.5 M'illio'n De'sktops E'very'day."
I've done a fair amount of natural language processing work. Let me think about some clever ways to check for this sort of thing... perhaps a dictionary lookup against punctuation-stripped words? Would a lookup via ispell be too much overhead?
---- : The tarproxy-list mailing list is archived at : http://www.mail-archive.com/tarproxy-list%40martiansoftware.com/ : : To unsubscribe from this list, follow the instructions at : http://www.martiansoftware.com/contact.html : : TarProxy's project page can be found at : http://www.martiansoftware.com/tarproxy
