I also use tesseract.  Ubuntu provides several OCR programs; about
2007--9 I tried three of them (gocr, ocrad and tesseract) and the last
was by far the most capable at the fundamental task of recognising
characters.  Recently installed [ubuntu package] tesseract-ocr 2.04-2
and it seems that they've made significant improvements in the last few
years.
Tesseract only understands tiff images. If you are scanning locally, set
scanimage to write tiff rather than pgm and avoid preprocessing the
image (eg with imagemagick) as much as you can.  The program will read
lines of text that depart appreciably from true horizontal, but it can
balk at files that have been much altered from the original scan.  It
has no option to select one column from a multicolumn page, so some
ingenuity may be needed (like masking the scanner).
It relies heavily on understanding the language of the document and I've
not tried it with any language other than English.  At the post-editing
stage do check numerals carefully.
Good luck anyway.
Regards, John

-- 
John Palmer
Preston near Weymouth, Dorset, England
e-mail:  jo...@bcs.org.uk (plain text preferred)
website: http://www.palmyra.me.uk/


--
Next meeting:  Bournemouth, Tuesday 2011-08-02 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
How to Report Bugs Effectively:  http://goo.gl/4Xue

Reply via email to