[Spambayes] Analyzing text in image spam (was: Spam in Images)

skip Fri, 18 Aug 2006 06:35:41 -0700

>>>>> "skip" == skip  <[EMAIL PROTECTED]> writes:


    Alice> .... i notice that many html code of image has the same format
    Alice> like <IMG ALT="" border="0"
    Alice> SRC="cid:[email protected]"> what the cid here mean?
    Alice> does it valueable for recognize this spam?

    skip> It just identifies an image that is delivered along with the
    skip> message.  By itself it doesn't mean a lot.

I should have given a bit more complete answer based on your message's more
general point.  I recently added a fair amount of code to SpamBayes to
"crack" the content of images.  The new code works very well for me.  If
you'd like to try it, here's what you'll need to do:

    1. Check out the latest source from the CVS repository.  (There's been
       no new release since my recent checkins.)  Install it.

    2. Install the Python Imaging Library:
           http://www.pythonware.com/products/pil/

    3a. (Windows) Grab the ocrad-cygwin package from the
       SpamBayes Files page:
           http://sourceforge.net/project/showfiles.php?group_id=61702
       Unpack the zip file and copy ocrad.exe somewhere on your PATH.

    3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web
        site:
            http://www.gnu.org/software/ocrad/ocrad.html
        Unpack and install it.

I realize this may not be all that straightforward for people who are unused
to installing open source software.  Once you've done it a couple times
though, it gets easier.  Hopefully, we can get another SpamBayes alpha
release out in the next little while.  (Tony, if there's anything I can do
to help make this happen, let me know.)

Once you're ready to go, add the following to your SpamBayes options:

    x-lookup_ip: True
    lookup_ip_cache: ~/.dnscache

    x-image_size: True

    x-crack_images: True
    crack_image_cache: ~/.image_cache.pickle

The first group is unrelated to the image spam, but I find it helps me a
lot.  It maps hostnames to their IP addresses using DNS and generates tokens
based on those addresses.  The second records tokens about the size of
images.  The third enables text extraction from images (OCR, or optical
character recognition).  This is where PIL and Ocrad come in.

I still get the occasional false negative on image spam, but it's definitely
manageable and should improve as Ocrad (itself still a very alpha piece of
software) improves.  Even though Ocrad does a poor job of text extraction
from a human comprehension standpoint, it generates tokens that SpamBayes
just loves and seems to generate enough unique tokens to tip the scales on
most image spam.

Skip
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

[Spambayes] Analyzing text in image spam (was: Spam in Images)

Reply via email to