Re: [Spambayes] Analyzing text in image spam (was: Spam in Images)

Alan Arndt Fri, 18 Aug 2006 10:18:08 -0700

Skip,

That sounds great.  Thanks.  I don't know if I will take all the steps to
try and get it up and running or wait for a new release, but we really
appreciate it.


I did have a few questions.  How much time/processing does the OCR take?  I
would think that might be very intensive.  Not that most people don't have
the cycles to spare, or that it wouldn't be much faster than scanning the
spam myself, but I'm just curious.

Also, should one re-initialize the spam database?  Are all tokens the same,
once extracted these are just like any other?  Or are they somehow grouped
to relate to images?

Thanks,
Alan

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 18, 2006 6:33 AM
To: Alice Bryson <[EMAIL PROTECTED]>; [email protected]; Alan Arndt
Subject: Analyzing text in image spam (was: Spam in Images)

>>>>> "skip" == skip  <[EMAIL PROTECTED]> writes:

    Alice> .... i notice that many html code of image has the same format
    Alice> like <IMG ALT="" border="0"
    Alice> SRC="cid:[email protected]"> what the cid here mean?
    Alice> does it valueable for recognize this spam?

    skip> It just identifies an image that is delivered along with the
    skip> message.  By itself it doesn't mean a lot.

I should have given a bit more complete answer based on your message's more
general point.  I recently added a fair amount of code to SpamBayes to
"crack" the content of images.  The new code works very well for me.  If
you'd like to try it, here's what you'll need to do:

    1. Check out the latest source from the CVS repository.  (There's been
       no new release since my recent checkins.)  Install it.

    2. Install the Python Imaging Library:
           http://www.pythonware.com/products/pil/

    3a. (Windows) Grab the ocrad-cygwin package from the
       SpamBayes Files page:
           http://sourceforge.net/project/showfiles.php?group_id=61702
       Unpack the zip file and copy ocrad.exe somewhere on your PATH.

    3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web
        site:
            http://www.gnu.org/software/ocrad/ocrad.html
        Unpack and install it.

I realize this may not be all that straightforward for people who are unused
to installing open source software.  Once you've done it a couple times
though, it gets easier.  Hopefully, we can get another SpamBayes alpha
release out in the next little while.  (Tony, if there's anything I can do
to help make this happen, let me know.)

Once you're ready to go, add the following to your SpamBayes options:

    x-lookup_ip: True
    lookup_ip_cache: ~/.dnscache

    x-image_size: True

    x-crack_images: True
    crack_image_cache: ~/.image_cache.pickle

The first group is unrelated to the image spam, but I find it helps me a
lot.  It maps hostnames to their IP addresses using DNS and generates tokens
based on those addresses.  The second records tokens about the size of
images.  The third enables text extraction from images (OCR, or optical
character recognition).  This is where PIL and Ocrad come in.

I still get the occasional false negative on image spam, but it's definitely
manageable and should improve as Ocrad (itself still a very alpha piece of
software) improves.  Even though Ocrad does a poor job of text extraction
from a human comprehension standpoint, it generates tokens that SpamBayes
just loves and seems to generate enough unique tokens to tip the scales on
most image spam.

Skip

_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Re: [Spambayes] Analyzing text in image spam (was: Spam in Images)

Reply via email to