Re: proofing searchable pdf files

Gary Roach Fri, 31 Oct 2014 15:30:07 -0700

On 10/30/2014 05:47 PM, Gary Roach wrote:

Hi all,
Problem:
I am working on an archiving project and wish to archive documentsto searchable pdf files but can't seem to figure out how to proof readand correct the text overlay. Any suggestions.
Tesseract seems to do a really great job but I have no good way ofproving this or correcting any mistakes. Some of the documents are 100years old and may not be in such great shape. I can always retypeeverything but would like to avoid this, as much as possible, forobvious reasons.
Gary R.

OK more detail.

First, searchable pdf files are a 2 layer file with the pdf vectorgraphics layer overlaying a text file. I have tried gimp but have notbeen able to separate the layers. Tesseract will show the text file butin box format. This seems to be Tesseract's native file structure(guessing) and is virtually unusable for proof reading. I have beenable to use Dolphin and Okular to get rid of the boxes but Okular justreplaces them with long strings of dots - also unusable for proof reading.


Transfer of the pdf file to LibreOffice writer produces garbage.

This is part of a medium sized, low budget archiving project that willprocess serveral thousand documents, all done by low tech volunteers. SoI really need methods that are straight forward or can be automated tothe idiot level. A method that will split the vector graphics and textfiles apart, allow editing of the text file and reassembling of the fileis needed. I am having trouble believing that there isn't software outthere that will do this but I have not been able to find it.

Your comments so far have pointed me in several different directions butI still haven't found an efficient (or even viable) editing method.


Your help is really appreciated.

Gary R.


--

To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.orgwith a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: https://lists.debian.org/54540dde.2030...@verizon.net

Re: proofing searchable pdf files

Reply via email to