That links to the homepage of that company, not to the file
Tilman
Am 18.04.2026 um 20:27 schrieb Robert Rodini:
Here is my problem pdf*** file https://limewire.com/?referrer=pq7i8xx7p2
***Actually, the 3rd party PDF is a concatenation of about 230 pages like this
but each of these follows the same pattern.
I don't think this is relevant, but I am just mentioning it.
Thanks for your consideration.
Bob
________________________________
From: Tilman Hausherr <[email protected]>
Sent: Saturday, April 18, 2026 11:53 AM
To: [email protected] <[email protected]>
Subject: Re: help w/ text extraction
Hi,
users mailing list is fine. Please share your file by uploading it to a
sharehoster.
Yes you can use the ExtractTextByArea class, or alternatively set
"beads" (kindof invisible rectangles) on top of the pages and then
extract normally, however both approaches require you to know where the
columns are.
Tilman
Am 17.04.2026 um 22:53 schrieb Robert Rodini:
I have used CLI PDFBox utility successfully for years on a third party PDF
which is issued twice a year. The PDF is always produced with a 3 column format
and the extracted text always comes out column by column from top to bottom
from each column. That is until now.
Not the 3rd party changed the internals of the PDF such that PDFBox extracts
the text in a somewhat unpredictable order. It seems to work left to right
horizontally multiple times. The extracted text in no longer in the expected
order.
Can you steer me to the PDFBox APIs that might help me understand the new
internal structure? My initial goal is to write a Java program that can
distinguish the old PDF files from the new PDF files. Later, to write my own
extraction program.
Thank you
P.S. Should this question be submitted [email protected]?
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]