Re: help w/ text extraction

Tilman Hausherr Sat, 18 Apr 2026 08:54:19 -0700

Hi,

users mailing list is fine. Please share your file by uploading it to asharehoster.

Yes you can use the ExtractTextByArea class, or alternatively set"beads" (kindof invisible rectangles) on top of the pages and thenextract normally, however both approaches require you to know where thecolumns are.


Tilman

Am 17.04.2026 um 22:53 schrieb Robert Rodini:

I have used CLI PDFBox utility successfully for years on a third party PDF 
which is issued twice a year. The PDF is always produced with a 3 column format 
and the extracted text always comes out column by column from top to bottom 
from each column.  That is until now.

Not the 3rd party changed the internals of the PDF such that PDFBox extracts 
the text in a somewhat unpredictable order. It seems to work left to right 
horizontally multiple times.  The extracted text in no longer in the expected 
order.

Can you steer me to the PDFBox APIs that might help me understand the new 
internal structure? My initial goal is to write a Java program that can 
distinguish the old PDF files from the new PDF files.  Later, to write my own 
extraction program.

Thank you

P.S. Should this question be submitted [email protected]?

Re: help w/ text extraction

Reply via email to