Re: Extract underlying PDF code from PDF file by selecting an area

Stefan Falk Thu, 15 Jan 2015 00:19:36 -0800

This is awesome! Thank you!

I will take a close look at it and update to the trunk version too.

Do you want me to report PDFs that could not be displayed correctly inthe future?


Best regards,
Stefan

On 2015-01-15 09:03, Maruan Sahyoun wrote:

Hi Stefan,

yes, PDFBox is capable of doing this. To crop the page to the dimensions you 
need you can use

PDPage.setCropBox 
[http://pdfbox.apache.org/docs/1.8.8/javadocs/org/apache/pdfbox/pdmodel/PDPage.html#setCropBox(org.apache.pdfbox.pdmodel.common.PDRectangle)]
As John pointed out, the SuperimposePage example will give you the basics to 
import and 'mount' the page into a new or existing PDF.

Only thing is to get the coordinates from the mouse and translate that to the 
dimensions for the rectangle in PDF.

BR
Maruan

Am 15.01.2015 um 08:48 schrieb Stefan Falk <[email protected]>:

Hi John!

Yes, clipping the PDF is basically what I would like to do! So would pdfbox the 
best choice for this? I have looked a lot for a library but it does not seem 
that there are many open source tools out there.

My target is a program that allows to clip PDFs in order to create a composed 
PDF out of all the clips and maybe you could tell me if pdfbox would be the 
best choice for such a task.

@fairly difficult: Well yes, I was quite astonished to find out that extracting 
content from a PDF is actually a scientific topic :D

Best regards,
Stefan

On 2015-01-15 03:21, John Hewson wrote:

Hi Stefan

What you’re describing is actually fairly difficult due to the complexity of 
the PDF operators, we have a special processor for text in PDFBox, but it is 
not necessarily accurate.

If you’re just trying to embed pages from existing PDFs into new PDFs then the 
SuperimposePage example which comes with PDFBox might already serve your needs. 
If you specify a custom BBox for the FormXObject, then you can use that to clip 
the page - which sounds like what you want. Please note that this technique 
still embeds all of the original page contents, so its not suitable for 
removing private or sensitive data, but otherwise it’s fine.

If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk 
version of PDFBox, where we have fixed many bugs.

Thanks

-- John

On 14 Jan 2015, at 15:14, Stefan Falk <[email protected]> wrote:

Well, basically just extract it to load it into another PDF  but it should be 
possible e.g. with the mouse.


On 2015-01-14 22:52, Maruan Sahyoun wrote:

what would you like to do with that content?

BR
Maruan

Am 14.01.2015 um 21:42 schrieb Stefan Falk <[email protected]>:

Hello pdfbox people!

I was wondering if anybody can help me with my needs. What I am looking for is 
a possibility to extract the underlying PDF code from a PDF file by simply 
selecting an area with your mouse.

After reading a few things about PDFs I have learned that anything that has to 
do with extraction anything from a PDF can be a quite hard task.

So I was wondering if pdfbox could do that somehow. I've taken a rough look at 
the PDFReader and I noticed that there is e.g. processTextPosition from the 
class PageDrawer that seem to allow me to get at least the position from Text - 
am I right in assuming that?

My concrete question would be what is possible with pdfbox regarding this matter? E.g. I 
have a PDF on my drive which text seems to be "extractable" by pdfbox on the 
one hand but on the other hand the PDFReader is not able to render any of it. It just 
renders the images (see attachment).

Thank you for your help in advance!

Best regards,
Stefan

Re: Extract underlying PDF code from PDF file by selecting an area

Reply via email to