On 30/10/14 08:47 PM, Gary Roach wrote:
Hi all,
Problem:
I am working on an archiving project and wish to archive documents
to searchable pdf files but can't seem to figure out how to proof read
and correct the text overlay. Any suggestions.
System:
Debian Wheezy
Intel i5-750 processor
HP Officejet Pro 8600 wireless all in one printer/fax/scanner
gscan2pdf software with Tesseract ocr
300 to 600 dpi scans.
Tesseract seems to do a really great job but I have no good way of
proving this or correcting any mistakes. Some of the documents are 100
years old and may not be in such great shape. I can always retype
everything but would like to avoid this, as much as possible, for
obvious reasons.
Gary R.
Tesseract is the tool for the job. Scan at 600 dpi for best results. If
the originals are typed/typeset the results should be good but you may
have to do some fiddling with the scans to bring out the detail.
The fastest way to proofread is to inhale the text into a word processor
and spell check. Grammar checking is also a help.
There are also Tesseract box editors you can try that let you edit the
Tesseract OCR files.
I thought Tesseract would let you adjust the search words when necessary.
--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/5452ff9a.7090...@torfree.net