Kenneth Porter wrote: > <http://developers.slashdot.org/comments.pl?sid=195752&cid=16041870> > > Theo just mentioned this on the -devel list: > > <http://article.gmane.org/gmane.mail.spam.spamassassin.devel/45374>
I posted about this last week on the Devel-Spam and Maia-Users lists, along with the results of some preliminary tests I conducted with Tesseract OCR vs. GOCR, and it looks promising. Here's what I posted: === post begins === It's already "usable"; I've compiled it and done some basic tests with it, and it does seem to work pretty well. On an arbitrary spam image, for instance, which starts with: CRITICAL INVESTOR ALERT! ESPION INTERNATIONAL INC (EPLJ.PK) The Tesseract OCR engine scanned this image and produced: CRITICAL INVESTOR ALERT! ESPION INTERNATIONAL INC (EPLJPK) By comparison, the GOCR engine (with default options) produces: cRITIcAE INvEsToR AEERT!. With -l 180 -d 2 though, GOCR does about as well as Tesseract, if you ignore case: cRITIcAL INvEsToR ALERT!. EspIoN INTERNATIoNAL INc (EpLJ.pK) One potential snag, though, is that Tesseract OCR only operates on TIFF images, and pnmtotiff wasn't able to produce a usable TIFF in this sample test. The ImageMagick "convert" utility worked, though. Another issue is that Tesseract OCR doesn't behave as a filter (i.e. it doesn't read from STDIN or write to STDOUT), it expects to be called like this: tesseract <image.tif> <outfile> batch which then produces three files: outfile.map outfile.txt outfile.raw The *.txt file is the extracted text that we're interested in. It shouldn't be too difficult to modify the sources to make Tesseract OCR behave like a proper filter though, and it's likely that such a patch would be welcomed by the maintainers at Google. At the moment--and based solely on this very basic bit of testing--I'd say that Tesseract OCR is comparable to GOCR, perhaps a bit more clever. More testing on different types of spam images will be required to know for sure. There's practically no documentation available for it yet, apart from the brief README, so it's unclear whether it accepts any parameters of the sort that GOCR does, or whether it scans everything with fixed settings or uses some sort of adaptive setting mechanism. Google seems to be in the process of hiring people to work on it, however, and with their large-scale book-scanning projects underway they have a vested interest in producing a high-quality OCR engine, so in six months or so Tesseract OCR might well be the engine of choice. Right now, though, I think GOCR can do a comparable job. === end of post === Upon discussing this with decoder (the author of the FuzzyOcr plugin), he maintains that none of these potential snags are real obstacles. His next release will supposedly include a new temporary file structure that can handle the *.txt file output without requiring the OCR engine to behave as a proper filter. I intend to continue testing Tesseract OCR with more image spam samples as I get ahold of them, but thus far it looks promising. The only /non-technical/ issue that occurs to me is in the licensing, which is a combination of the Apache License (2.0) and a custom clause that may be a non-starter for some applications: "If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use." This might preclude its (free) use in SpamAssassin-based appliances, or at for-profit ISPs and offsite mail-filtering services. If the commercial license fees (or "conditions of use") are reasonable, however, this may be a minor issue, particularly if the engine proves to be better than anything else in its price range. Always nice to have options, even if they're not always free. -- Robert LeBlanc <[EMAIL PROTECTED]> Renaissoft, Inc. Maia Mailguard <http://www.maiamailguard.com/>
signature.asc
Description: OpenPGP digital signature