Re: [Spambayes] Analyzing text in image spam (was: Spam in Images)

Fast Turtle Fri, 18 Aug 2006 10:20:45 -0700

On Friday 18 August 2006 06:32, [EMAIL PROTECTED] wrote:
> >>>>> "skip" == skip  <[EMAIL PROTECTED]> writes:
>
>     Alice> .... i notice that many html code of image has the same format
>     Alice> like <IMG ALT="" border="0"
>     Alice> SRC="cid:[email protected]"> what the cid here
> mean? Alice> does it valueable for recognize this spam?
>
>     skip> It just identifies an image that is delivered along with the
>     skip> message.  By itself it doesn't mean a lot.
>
> I should have given a bit more complete answer based on your message's more
> general point.  I recently added a fair amount of code to SpamBayes to
> "crack" the content of images.  The new code works very well for me.  If
> you'd like to try it, here's what you'll need to do:
>
>     1. Check out the latest source from the CVS repository.  (There's been
>        no new release since my recent checkins.)  Install it.
>
>     2. Install the Python Imaging Library:
>            http://www.pythonware.com/products/pil/
>
>     3a. (Windows) Grab the ocrad-cygwin package from the
>        SpamBayes Files page:
>            http://sourceforge.net/project/showfiles.php?group_id=61702
>        Unpack the zip file and copy ocrad.exe somewhere on your PATH.
>
>     3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web
>         site:
>             http://www.gnu.org/software/ocrad/ocrad.html
>         Unpack and install it.
>
> I realize this may not be all that straightforward for people who are
> unused to installing open source software.  Once you've done it a couple
> times though, it gets easier.  Hopefully, we can get another SpamBayes
> alpha release out in the next little while.  (Tony, if there's anything I
> can do to help make this happen, let me know.)
>
> Once you're ready to go, add the following to your SpamBayes options:
>
>     x-lookup_ip: True
>     lookup_ip_cache: ~/.dnscache
>
>     x-image_size: True
>
>     x-crack_images: True
>     crack_image_cache: ~/.image_cache.pickle
>
> The first group is unrelated to the image spam, but I find it helps me a
> lot.  It maps hostnames to their IP addresses using DNS and generates
> tokens based on those addresses.  The second records tokens about the size
> of images.  The third enables text extraction from images (OCR, or optical
> character recognition).  This is where PIL and Ocrad come in.
>
> I still get the occasional false negative on image spam, but it's
> definitely manageable and should improve as Ocrad (itself still a very
> alpha piece of software) improves.  Even though Ocrad does a poor job of
> text extraction from a human comprehension standpoint, it generates tokens
> that SpamBayes just loves and seems to generate enough unique tokens to tip
> the scales on most image spam.
>
> Skip
> _______________________________________________
> [email protected]
> http://mail.python.org/mailman/listinfo/spambayes
> Check the FAQ before asking: http://spambayes.sf.net/faq.html
Just had an Idea that you may want to think about.


If a message has an html part with a CID that references a proper domain 
name/ip address, why not add an option to tag that as high spam probablility 
unless it's on a whitelist.

What I'm thinking is that most clients that can handle html mail now include 
the option to not load images from the web. Personally, I prefer and pretty 
much strictly use plain text and the only html format mail I even consider 
legit is white listed and I suspect many of us have the same belief

pgpxMFuYcR2El.pgp
Description: PGP signature

_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Re: [Spambayes] Analyzing text in image spam (was: Spam in Images)

Reply via email to