Re: Detect headers of PDF

Tilman Hausherr Wed, 27 Jul 2016 23:10:37 -0700

Am 28.07.2016 um 03:27 schrieb Qingchao Kong:

Hi, I want to detect the headers of PDF docs.


In my PDF files, I notice that, usually the headers of PDF and the
main text body are separated  by a horizontal line. Is it possible to
detect this "line" using Java code?


Yes but this is tricky, PDF does not have a <HEADER>. Have a look here:
https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/RemoveAllText.java?view=markup

This does something else, but the principle is the same: Analyze thecontent stream.


To understand what the PDF operators do, get the PDF 32000 specification
https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
and go to the segment "operator summary".

If you're lucky, the line is really a line, i.e. operators m and l. Ifnot lucky, it is a small image, or a rectangle.


Tilman


If this is possible, so I can get the rectangle of the main text area
and remove the headers automatically using Java code.



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Detect headers of PDF

Reply via email to