Hi, Thanks for the quick response. I have uploaded one of the pages at
https://www.dropbox.com/s/7cqlul61pk53gd1/testpage.pdf Any pointers how I could extend things would be great. Thanks, Stuart On 12 Jun 2013, at 20:52, Maruan Sahyoun wrote: > Hi Stuart, > > from the screenshot it's not clear how the PDF is layer out. In general there > are some structures like article threads which PDFBox supports for text > extraction. Also PDFBox is able to handle bookmarks, annotations …. although > some of these informations are not taken into account when using the standard > ExtractText functionality. But it's possible to extend existing functions. > With the PDF as a sample it would be easier to understand which PDF features > is used for the box and give you some additional hints. As the mailing list > doesn't allow for PDF attachments please upload a sample at a public location > if possible. > > BR > Maruan Sahyoun > > Am 12.06.2013 um 21:35 schrieb Stuart Coleman <[email protected]>: > >> Hi, >> >> I have a PDF file which I am trying to extract text from. Unfortunately the >> document is non sequential and has various boxes with supplementary content. >> When I open the file in Acrobat Reader, Reader seems to be able to >> distinguish these features and can surround them with a blue bounding box. I >> would like to be able to extract text by area from within these bounding >> boxes? Is PDFBox capable of detecting these features also? >> >> I have attached a screenshot showing the style of box I am referring to (top >> right hand corner) >> >> Thanks >> Stuart >> >> <Screen Shot 2013-06-12 at 20.17.31.png> >

