On Feb 20, 9:02 pm, Jon Andersen <jande...@gmail.com> wrote:

> My project athttp://RecordAGrave.comis about recording headstones from
> graves and posting the text and images on the Net so that people can
> research their family history.  I would appreciate some advice on how to
> pre-process these headstone images to get the best results from Tesseract
> OCR.  I have thousands of 1-2 MB jpg images of headstones to process.

Post-image capture is too late for one of the most important
enhancements, namely high contrast lighting.  It's not really an issue
with stones that have the carving painted or are otherwise naturally
high contrast, but for many stones sharp oblique lighting is important
to get an image that's readable by humans, let alone OCR software.

Once you've got the best quality image capture you can manage, you'll
probably find that you need to use different image processing
pipelines for different types of stones and carving, so the first step
will be to categorize the stone and figure out which pipeline to run
it through (or run it through them all and compare the results).

In addition to image processing, you may also be able to improve
results by making use of the fact that the vocabulary and layout of
the text is much more constrained than free text.

It'll be interesting to see what kind of results you get.  I suspect
it's going to be a fairly challenging project for the general case,
but you may be able to pick of the low hanging fruit and gradually
expand the types of stones you can handle.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to