Re: Extract underlying PDF code from PDF file by selecting an area

Maruan Sahyoun Thu, 15 Jan 2015 00:07:25 -0800

Hi Stefan,

yes, PDFBox is capable of doing this. To crop the page to the dimensions you 
need you can use


PDPage.setCropBox 
[http://pdfbox.apache.org/docs/1.8.8/javadocs/org/apache/pdfbox/pdmodel/PDPage.html#setCropBox(org.apache.pdfbox.pdmodel.common.PDRectangle)]
As John pointed out, the SuperimposePage example will give you the basics to 
import and 'mount' the page into a new or existing PDF.

Only thing is to get the coordinates from the mouse and translate that to the 
dimensions for the rectangle in PDF.

BR
Maruan

Am 15.01.2015 um 08:48 schrieb Stefan Falk <[email protected]>:

> Hi John!
> 
> Yes, clipping the PDF is basically what I would like to do! So would pdfbox 
> the best choice for this? I have looked a lot for a library but it does not 
> seem that there are many open source tools out there.
> 
> My target is a program that allows to clip PDFs in order to create a composed 
> PDF out of all the clips and maybe you could tell me if pdfbox would be the 
> best choice for such a task.
> 
> @fairly difficult: Well yes, I was quite astonished to find out that 
> extracting content from a PDF is actually a scientific topic :D
> 
> Best regards,
> Stefan
> 
> On 2015-01-15 03:21, John Hewson wrote:
>> Hi Stefan
>> 
>> What you’re describing is actually fairly difficult due to the complexity of 
>> the PDF operators, we have a special processor for text in PDFBox, but it is 
>> not necessarily accurate.
>> 
>> If you’re just trying to embed pages from existing PDFs into new PDFs then 
>> the SuperimposePage example which comes with PDFBox might already serve your 
>> needs. If you specify a custom BBox for the FormXObject, then you can use 
>> that to clip the page - which sounds like what you want. Please note that 
>> this technique still embeds all of the original page contents, so its not 
>> suitable for removing private or sensitive data, but otherwise it’s fine.
>> 
>> If you have PDFs which PDFReader can’t render, please try using the 2.0 
>> trunk version of PDFBox, where we have fixed many bugs.
>> 
>> Thanks
>> 
>> -- John
>> 
>>> On 14 Jan 2015, at 15:14, Stefan Falk <[email protected]> wrote:
>>> 
>>> Well, basically just extract it to load it into another PDF  but it should 
>>> be possible e.g. with the mouse.
>>> 
>>> 
>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>> what would you like to do with that content?
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <[email protected]>:
>>>> 
>>>>> Hello pdfbox people!
>>>>> 
>>>>> I was wondering if anybody can help me with my needs. What I am looking 
>>>>> for is a possibility to extract the underlying PDF code from a PDF file 
>>>>> by simply selecting an area with your mouse.
>>>>> 
>>>>> After reading a few things about PDFs I have learned that anything that 
>>>>> has to do with extraction anything from a PDF can be a quite hard task.
>>>>> 
>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough 
>>>>> look at the PDFReader and I noticed that there is e.g. 
>>>>> processTextPosition from the class PageDrawer that seem to allow me to 
>>>>> get at least the position from Text - am I right in assuming that?
>>>>> 
>>>>> My concrete question would be what is possible with pdfbox regarding this 
>>>>> matter? E.g. I have a PDF on my drive which text seems to be 
>>>>> "extractable" by pdfbox on the one hand but on the other hand the 
>>>>> PDFReader is not able to render any of it. It just renders the images 
>>>>> (see attachment).
>>>>> 
>>>>> Thank you for your help in advance!
>>>>> 
>>>>> Best regards,
>>>>> Stefan
>> 
>

Re: Extract underlying PDF code from PDF file by selecting an area

Reply via email to