Please forgive now the obvious dumb question... How do we get these new things to try them out?
Thanks, Eric -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of [EMAIL PROTECTED] Sent: August 6, 2006 11:26 AM To: [email protected]; [email protected] Subject: [Spambayes] Several new tokenizing gimmicks checked in With the current crop of pump & dump spams I decided to break down and actually see if ocrad (http://www.gnu.org/software/ocrad/ocrad.html) would help. It does a miserable job from a readability standpoint at extracting text from an image, but SpamBayes seems to love what it does generate. This morning I thought, "what the hell", and checked in all the current new tricks I've been working on/with: * IP address lookup and more extensive tokenization. This is from Matt Cowles. I added persistence beyond the current run. Unfortunately, the dbm persistence is untested (though should probably work okay) while the zodb persistence still has problems (writes the file the first time, but doesn't update it on successive runs). Maybe someone can look at those issues. This seems to work very well for those spams where the only useful clue is a URL, but with a domain name that changes each time. They seem to pretty much all point to the same IP address as far as I can tell. Enabled using the x-lookup_ip and lookup_ip_cache options. Requires installation of PyDNS. * Note image size. This was my first stab at trying to get some information out of an image. Seems to work pretty well. Enabled using the x-image_size option. * Note short runs of too-short words. Text spammers (as opposed to image spammers) seem to like to use this technique: X j A m N j A d X h M k E z R d I p D u I m A c C o I d A t L j I v S j to hide their tokens from spam filters. Enabled using the x-short_runs option. Based on my current database I'm skeptical this will add much over what else we already have. * Try OCR on images. The latest technique we've all encountered seems to be the pump and dump stock scams where the entire come-on is embedded in one or more GIF images. I wrote a small ImageStripper module which handles these. It grabs the image parts, converts them to netpbm format, concatenates them left-to-right, then submits the result to ocrad. This is just a proof-of-concept. It requires ocrad and netpbm to be available. As such I suspect it will only run currently on Unix-like systems. Enabled using the x-crack_images and max_image_size options. I added these extensions using multiple checkins, so if we decide to back one or more of them out it shouldn't be a major PITA. Skip _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.394 / Virus Database: 268.10.7/410 - Release Date: 05/08/2006 -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.394 / Virus Database: 268.10.7/410 - Release Date: 05/08/2006 _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
