Am 23.05.25 um 19:19 schrieb Tilman Hausherr:
On 23.05.2025 17:01, Robert Rodini wrote:
This question is informational.  I use PDFBox utilities to extract text from a large PDF file.  The pages of the PDF always contain a three-column format. PDF Box CLI utility is wonderful since it processes the columns from top to bottom and left to right.

Is there a way to use Apache PDF Box to recognize column breaks (start of a new column) and page breaks (start of new page) as the text is being extracted?


No but you could use ExtractTextByArea if you know the coordinates.
Juts for the peanut gallery: it is easy to detect page breaks, as PDF docs are organized in pages.

Everything else is more or less complicated/possible as pointed out by Tilman.

Andreas


Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to