Re: Is PDFBox capable of detecting features Acrobat Reader can highlight

Maruan Sahyoun Wed, 12 Jun 2013 12:53:54 -0700

Hi Stuart,

from the screenshot it's not clear how the PDF is layer out. In general there 
are some structures like article threads which PDFBox supports for text 
extraction. Also PDFBox is able to handle bookmarks, annotations …. although 
some of these informations are not taken into account when using the standard 
ExtractText functionality. But it's possible to extend existing functions. With 
the PDF as a sample it would be easier to understand which PDF features is used 
for the box and give you some additional hints. As the mailing list doesn't 
allow for PDF attachments please upload a sample at a public location if 
possible.


BR
Maruan Sahyoun

Am 12.06.2013 um 21:35 schrieb Stuart Coleman <[email protected]>:

> Hi,
> 
> I have a PDF file which I am trying to extract text from. Unfortunately the 
> document is non sequential and has various boxes with supplementary content. 
> When I open the file in Acrobat Reader, Reader seems to be able to 
> distinguish these features and can surround them with a blue bounding box. I 
> would like to be able to extract text by area from within these bounding 
> boxes? Is PDFBox capable of detecting these features also?
> 
> I have attached a screenshot showing the style of box I am referring to (top 
> right hand corner)
> 
> Thanks
> Stuart
> 
> <Screen Shot 2013-06-12 at 20.17.31.png>

Re: Is PDFBox capable of detecting features Acrobat Reader can highlight

Reply via email to