On 10/30/2014 05:47 PM, Gary Roach wrote:
Hi all,
Problem:
I am working on an archiving project and wish to archive documents
to searchable pdf files but can't seem to figure out how to proof read
and correct the text overlay. Any suggestions.
Tesseract seems to do a really great job but I have no good way of
proving this or correcting any mistakes. Some of the documents are 100
years old and may not be in such great shape. I can always retype
everything but would like to avoid this, as much as possible, for
obvious reasons.
Gary R.
OK more detail.
First, searchable pdf files are a 2 layer file with the pdf vector
graphics layer overlaying a text file. I have tried gimp but have not
been able to separate the layers. Tesseract will show the text file but
in box format. This seems to be Tesseract's native file structure
(guessing) and is virtually unusable for proof reading. I have been
able to use Dolphin and Okular to get rid of the boxes but Okular just
replaces them with long strings of dots - also unusable for proof reading.
Transfer of the pdf file to LibreOffice writer produces garbage.
This is part of a medium sized, low budget archiving project that will
process serveral thousand documents, all done by low tech volunteers. So
I really need methods that are straight forward or can be automated to
the idiot level. A method that will split the vector graphics and text
files apart, allow editing of the text file and reassembling of the file
is needed. I am having trouble believing that there isn't software out
there that will do this but I have not been able to find it.
Your comments so far have pointed me in several different directions but
I still haven't found an efficient (or even viable) editing method.
Your help is really appreciated.
Gary R.
--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/54540dde.2030...@verizon.net