Kenneth Porter wrote:
> <http://developers.slashdot.org/comments.pl?sid=195752&cid=16041870>
> 
> Theo just mentioned this on the -devel list:
> 
> <http://article.gmane.org/gmane.mail.spam.spamassassin.devel/45374>

I posted about this last week on the Devel-Spam and Maia-Users lists,
along with the results of some preliminary tests I conducted with
Tesseract OCR vs. GOCR, and it looks promising.  Here's what I posted:

=== post begins ===

It's already "usable"; I've compiled it and done some basic tests with
it, and it does seem to work pretty well.  On an arbitrary spam image,
for instance, which starts with:

  CRITICAL INVESTOR ALERT!
  ESPION INTERNATIONAL INC (EPLJ.PK)

The Tesseract OCR engine scanned this image and produced:

  CRITICAL INVESTOR ALERT!
  ESPION INTERNATIONAL INC (EPLJPK)

By comparison, the GOCR engine (with default options) produces:

  cRITIcAE INvEsToR AEERT!.

With -l 180 -d 2 though, GOCR does about as well as Tesseract, if you
ignore case:

  cRITIcAL INvEsToR ALERT!.
  EspIoN INTERNATIoNAL INc (EpLJ.pK)

One potential snag, though, is that Tesseract OCR only operates on TIFF
images, and pnmtotiff wasn't able to produce a usable TIFF in this
sample test.  The ImageMagick "convert" utility worked, though.

Another issue is that Tesseract OCR doesn't behave as a filter (i.e. it
doesn't read from STDIN or write to STDOUT), it expects to be called
like this:

  tesseract <image.tif> <outfile> batch

which then produces three files:

  outfile.map
  outfile.txt
  outfile.raw

The *.txt file is the extracted text that we're interested in.  It
shouldn't be too difficult to modify the sources to make Tesseract OCR
behave like a proper filter though, and it's likely that such a patch
would be welcomed by the maintainers at Google.

At the moment--and based solely on this very basic bit of testing--I'd
say that Tesseract OCR is comparable to GOCR, perhaps a bit more clever.
More testing on different types of spam images will be required to know
for sure.  There's practically no documentation available for it yet,
apart from the brief README, so it's unclear whether it accepts any
parameters of the sort that GOCR does, or whether it scans everything
with fixed settings or uses some sort of adaptive setting mechanism.

Google seems to be in the process of hiring people to work on it,
however, and with their large-scale book-scanning projects underway they
have a vested interest in producing a high-quality OCR engine, so in six
months or so Tesseract OCR might well be the engine of choice.  Right
now, though, I think GOCR can do a comparable job.

=== end of post ===

Upon discussing this with decoder (the author of the FuzzyOcr plugin),
he maintains that none of these potential snags are real obstacles.  His
next release will supposedly include a new temporary file structure that
can handle the *.txt file output without requiring the OCR engine to
behave as a proper filter.  I intend to continue testing Tesseract OCR
with more image spam samples as I get ahold of them, but thus far it
looks promising.

The only /non-technical/ issue that occurs to me is in the licensing,
which is a combination of the Apache License (2.0) and a custom clause
that may be a non-starter for some applications: "If you wish to use it
for commercial gain you must contact The MITRE Corporation for
conditions of use."  This might preclude its (free) use in
SpamAssassin-based appliances, or at for-profit ISPs and offsite
mail-filtering services.  If the commercial license fees (or "conditions
of use") are reasonable, however, this may be a minor issue,
particularly if the engine proves to be better than anything else in its
price range.  Always nice to have options, even if they're not always free.

-- 
Robert LeBlanc <[EMAIL PROTECTED]>
Renaissoft, Inc.
Maia Mailguard <http://www.maiamailguard.com/>

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to