Re: help w/ text extraction

Gilad Denneboom Fri, 17 Apr 2026 14:14:33 -0700

You can use the ExtractTextByArea util class to extract each column on its
own (assuming their dimensions are always the same), and then concatenate
those strings together, to get the page's text in the correct order.


On Fri, Apr 17, 2026 at 10:54 PM Robert Rodini <[email protected]> wrote:

> I have used CLI PDFBox utility successfully for years on a third party PDF
> which is issued twice a year. The PDF is always produced with a 3 column
> format and the extracted text always comes out column by column from top to
> bottom from each column.  That is until now.
>
> Not the 3rd party changed the internals of the PDF such that PDFBox
> extracts the text in a somewhat unpredictable order. It seems to work left
> to right horizontally multiple times.  The extracted text in no longer in
> the expected order.
>
> Can you steer me to the PDFBox APIs that might help me understand the new
> internal structure? My initial goal is to write a Java program that can
> distinguish the old PDF files from the new PDF files.  Later, to write my
> own extraction program.
>
> Thank you
>
> P.S. Should this question be submitted to [email protected]?
>

Re: help w/ text extraction

Reply via email to