Skip, That sounds great. Thanks. I don't know if I will take all the steps to try and get it up and running or wait for a new release, but we really appreciate it.
I did have a few questions. How much time/processing does the OCR take? I would think that might be very intensive. Not that most people don't have the cycles to spare, or that it wouldn't be much faster than scanning the spam myself, but I'm just curious. Also, should one re-initialize the spam database? Are all tokens the same, once extracted these are just like any other? Or are they somehow grouped to relate to images? Thanks, Alan -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, August 18, 2006 6:33 AM To: Alice Bryson <[EMAIL PROTECTED]>; [email protected]; Alan Arndt Subject: Analyzing text in image spam (was: Spam in Images) >>>>> "skip" == skip <[EMAIL PROTECTED]> writes: Alice> .... i notice that many html code of image has the same format Alice> like <IMG ALT="" border="0" Alice> SRC="cid:[email protected]"> what the cid here mean? Alice> does it valueable for recognize this spam? skip> It just identifies an image that is delivered along with the skip> message. By itself it doesn't mean a lot. I should have given a bit more complete answer based on your message's more general point. I recently added a fair amount of code to SpamBayes to "crack" the content of images. The new code works very well for me. If you'd like to try it, here's what you'll need to do: 1. Check out the latest source from the CVS repository. (There's been no new release since my recent checkins.) Install it. 2. Install the Python Imaging Library: http://www.pythonware.com/products/pil/ 3a. (Windows) Grab the ocrad-cygwin package from the SpamBayes Files page: http://sourceforge.net/project/showfiles.php?group_id=61702 Unpack the zip file and copy ocrad.exe somewhere on your PATH. 3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web site: http://www.gnu.org/software/ocrad/ocrad.html Unpack and install it. I realize this may not be all that straightforward for people who are unused to installing open source software. Once you've done it a couple times though, it gets easier. Hopefully, we can get another SpamBayes alpha release out in the next little while. (Tony, if there's anything I can do to help make this happen, let me know.) Once you're ready to go, add the following to your SpamBayes options: x-lookup_ip: True lookup_ip_cache: ~/.dnscache x-image_size: True x-crack_images: True crack_image_cache: ~/.image_cache.pickle The first group is unrelated to the image spam, but I find it helps me a lot. It maps hostnames to their IP addresses using DNS and generates tokens based on those addresses. The second records tokens about the size of images. The third enables text extraction from images (OCR, or optical character recognition). This is where PIL and Ocrad come in. I still get the occasional false negative on image spam, but it's definitely manageable and should improve as Ocrad (itself still a very alpha piece of software) improves. Even though Ocrad does a poor job of text extraction from a human comprehension standpoint, it generates tokens that SpamBayes just loves and seems to generate enough unique tokens to tip the scales on most image spam. Skip _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
