I also use tesseract. Ubuntu provides several OCR programs; about 2007--9 I tried three of them (gocr, ocrad and tesseract) and the last was by far the most capable at the fundamental task of recognising characters. Recently installed [ubuntu package] tesseract-ocr 2.04-2 and it seems that they've made significant improvements in the last few years. Tesseract only understands tiff images. If you are scanning locally, set scanimage to write tiff rather than pgm and avoid preprocessing the image (eg with imagemagick) as much as you can. The program will read lines of text that depart appreciably from true horizontal, but it can balk at files that have been much altered from the original scan. It has no option to select one column from a multicolumn page, so some ingenuity may be needed (like masking the scanner). It relies heavily on understanding the language of the document and I've not tried it with any language other than English. At the post-editing stage do check numerals carefully. Good luck anyway. Regards, John
-- John Palmer Preston near Weymouth, Dorset, England e-mail: jo...@bcs.org.uk (plain text preferred) website: http://www.palmyra.me.uk/ -- Next meeting: Bournemouth, Tuesday 2011-08-02 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ How to Report Bugs Effectively: http://goo.gl/4Xue