Re: Image pre-processing for good OCR results

TP Wed, 23 Feb 2011 21:36:16 -0800

On Sun, Feb 20, 2011 at 6:02 PM, Jon Andersen <jande...@gmail.com> wrote:
> Hi,
> My project at http://RecordAGrave.com is about recording headstones from
> graves and posting the text and images on the Net so that people can
> research their family history.  I would appreciate some advice on how to
> pre-process these headstone images to get the best results from Tesseract
> OCR.  I have thousands of 1-2 MB jpg images of headstones to process.
> Example images:
> http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg
> http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg
> http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg
> I am a software developer so I can script up pre-processing steps to prepare
> the input for Tesseract.
> Any advice on improving OCR accuracy through pre-processing steps?
> Thanks so much,
>
> -Jon
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>


I guess I'm a bit surprised that no one has yet mentioned the fact
that the Leptonica C Image Processing Library
(http://www.leptonica.com) is now required to build tesseract-ocr --
or soon will be... the current state of tesseract-ocr is a bit hazy.
My understanding is that eventually (not in the near future though)
tesseract-ocr will only use Leptonica PIXs as its in-memory image
representation.

A still unofficial, easier to read, Sphinx generated version of the
Leptonica documentation is at
http://tpgit.github.com/UnOfficialLeptDocs/. Dan is currently
hammering away at v1.68 and it should be out soon (this week?). At
which point I'll also update my unofficial version of the
documentation.

My admittedly quick/biased opinion was that OpenCV focused on Computer
Vision and that Leptonica has more "pure" Image Processing routines. I
also find Leptonica's source code fairly easy to read because one of
the purposes of the library is to try to teach image processing
concepts.

In any case, if you're planning on using tesseract-ocr 3.x, then you
already must have liblept, so you might as well try it out.

-- TP

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Image pre-processing for good OCR results

Reply via email to