Re: Merging pages of two PDF documents

Tilman Hausherr Tue, 27 Oct 2015 09:55:20 -0700

Am 27.10.2015 um 10:12 schrieb Timm Friedholz:

On 27.10.2015, at 06:51, Maruan Sahyoun <[email protected]> wrote:

Hi,

Am 26.10.2015 um 22:46 schrieb Timm Friedholz <[email protected]>:

Hello,

I have some PDF documents in which the glyph-Unicode character mapping is 
destroyed so that it's not possible to search and copy the text.  In an attempt 
to remove this restriction I've converted the PDFs to TIFF images and performed 
OCR on them using tesseract.  Tesseract exports the recognized text as PDF 
files in which the text is positioned transparently on top of the images such 
that the text is searchable and selectable.

The problem is that the images in the PDF that tesseract outputs are 
gray-scaled, large and high contrast versions of the original PDFs and I would 
like to have the quality and file size of the original PDFs instead.  Thus my 
idea is to copy  the text objects of the OCR output to the original PDFs.  To 
avoid interference with the old text, I've converted the original PDFs to 
vector paths using Ghostscript:  gs -o out.pdf -dNoOutputFonts 
-sDEVICE=pdfwrite in.pdf

Now the problem is that I'm not sure how to approach this programmatically.  
Can I simply iterate over the pages and copy the text objects from each page of 
one document to the corresponding page of the other document?  Which operators 
do I need to copy if I parse it token by token?  Should I actually do it as 
directly via the PDFStreamParser class or are there abstraction in PDFBox that 
will make this easier?

the easiest might be to
a) remove the images from the OCR'ed document
b) overlay the pages from the OCR'ed document over the original PDF using 
org.apache.pdfbox.multipdf.Overlay

BR
Maruan

Hello Maruan,

Thanks for your reply. How would I go about removing the images exactly? I 
think this is the line that defines the images in tesseract's PDF renderer:

https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd44f0396b08f96f/api/pdfrenderer.cpp#L755
 
<https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd44f0396b08f96f/api/pdfrenderer.cpp#L755>

Would I be able to access the image objects if I run the PDFStreamParser on the 
contents per page as it’s done in some of the examples, or are they stored 
somewhere else in the PDF file? Which operators mark the beginning and end of 
it?

Timm

The easiest would be to remove the invoke calls from the content stream.You can still remove the actual images later (so that you project goesforward) This looks somewhat like this:


/Im1 Do

To see this better, use PDFDebugger.
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/debugger-app/

If possible, upload such a PDF somewhere.

The line you mention is where the object is defined, but the name of theobject is defined in the resources dictionary. Just removing it willcreate just a mess.

You can get the token list, and rewrite the tokens withContentStreamWriter.writeTokens


Tilman

Re: Merging pages of two PDF documents

Reply via email to