Am 23.05.25 um 19:19 schrieb Tilman Hausherr:
On 23.05.2025 17:01, Robert Rodini wrote:
This question is informational. I use PDFBox utilities to extract
text from a large PDF file. The pages of the PDF always contain a
three-column format. PDF Box CLI utility is wonderful since it
processes the columns from top to bottom and left to right.
Is there a way to use Apache PDF Box to recognize column breaks (start
of a new column) and page breaks (start of new page) as the text is
being extracted?
No but you could use ExtractTextByArea if you know the coordinates.
Juts for the peanut gallery: it is easy to detect page breaks, as PDF
docs are organized in pages.
Everything else is more or less complicated/possible as pointed out by
Tilman.
Andreas
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org