Re: detection of column breaks and page breaks in PDF document

Andreas Lehmkühler Thu, 29 May 2025 03:44:37 -0700



Am 23.05.25 um 19:19 schrieb Tilman Hausherr:

On 23.05.2025 17:01, Robert Rodini wrote:
This question is informational. I use PDFBox utilities to extracttext from a large PDF file. The pages of the PDF always contain athree-column format. PDF Box CLI utility is wonderful since itprocesses the columns from top to bottom and left to right.
Is there a way to use Apache PDF Box to recognize column breaks (startof a new column) and page breaks (start of new page) as the text isbeing extracted?
No but you could use ExtractTextByArea if you know the coordinates.

Juts for the peanut gallery: it is easy to detect page breaks, as PDFdocs are organized in pages.

Everything else is more or less complicated/possible as pointed out byTilman.


Andreas


Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: detection of column breaks and page breaks in PDF document

Reply via email to