On Friday 18 August 2006 06:32, [EMAIL PROTECTED] wrote: > >>>>> "skip" == skip <[EMAIL PROTECTED]> writes: > > Alice> .... i notice that many html code of image has the same format > Alice> like <IMG ALT="" border="0" > Alice> SRC="cid:[email protected]"> what the cid here > mean? Alice> does it valueable for recognize this spam? > > skip> It just identifies an image that is delivered along with the > skip> message. By itself it doesn't mean a lot. > > I should have given a bit more complete answer based on your message's more > general point. I recently added a fair amount of code to SpamBayes to > "crack" the content of images. The new code works very well for me. If > you'd like to try it, here's what you'll need to do: > > 1. Check out the latest source from the CVS repository. (There's been > no new release since my recent checkins.) Install it. > > 2. Install the Python Imaging Library: > http://www.pythonware.com/products/pil/ > > 3a. (Windows) Grab the ocrad-cygwin package from the > SpamBayes Files page: > http://sourceforge.net/project/showfiles.php?group_id=61702 > Unpack the zip file and copy ocrad.exe somewhere on your PATH. > > 3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web > site: > http://www.gnu.org/software/ocrad/ocrad.html > Unpack and install it. > > I realize this may not be all that straightforward for people who are > unused to installing open source software. Once you've done it a couple > times though, it gets easier. Hopefully, we can get another SpamBayes > alpha release out in the next little while. (Tony, if there's anything I > can do to help make this happen, let me know.) > > Once you're ready to go, add the following to your SpamBayes options: > > x-lookup_ip: True > lookup_ip_cache: ~/.dnscache > > x-image_size: True > > x-crack_images: True > crack_image_cache: ~/.image_cache.pickle > > The first group is unrelated to the image spam, but I find it helps me a > lot. It maps hostnames to their IP addresses using DNS and generates > tokens based on those addresses. The second records tokens about the size > of images. The third enables text extraction from images (OCR, or optical > character recognition). This is where PIL and Ocrad come in. > > I still get the occasional false negative on image spam, but it's > definitely manageable and should improve as Ocrad (itself still a very > alpha piece of software) improves. Even though Ocrad does a poor job of > text extraction from a human comprehension standpoint, it generates tokens > that SpamBayes just loves and seems to generate enough unique tokens to tip > the scales on most image spam. > > Skip > _______________________________________________ > [email protected] > http://mail.python.org/mailman/listinfo/spambayes > Check the FAQ before asking: http://spambayes.sf.net/faq.html Just had an Idea that you may want to think about.
If a message has an html part with a CID that references a proper domain name/ip address, why not add an option to tag that as high spam probablility unless it's on a whitelist. What I'm thinking is that most clients that can handle html mail now include the option to not load images from the web. Personally, I prefer and pretty much strictly use plain text and the only html format mail I even consider legit is white listed and I suspect many of us have the same belief
pgpxMFuYcR2El.pgp
Description: PGP signature
_______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
