Hi,

Am 02.05.2014 13:18, schrieb Qingchao Kong:
Paul,
I think I am aware the difference of
"stripper.setSortByPosition(true)" and
"stripper.setSortByPosition(false)". It is best explained when you try
to extract a PDF who has multiple columns, e.g. two columns.

When you have "stripper.setSortByPosition(false)", the extraction
result is usually the reading procedure which is fine. But when you
have "stripper.setSortByPosition(true)", PDFBox will extract text from
top to bottom, ignoring the columns, which is not expected by me.
I'm afraid there is a misunderstanding. PDFBox can't extract text context sensitive. e.g. detecting columns, header or footer.

Just for clarification:

sortByPosition = false

PDFBox extracts the text following their appearance in the pdf. In most cases the text will be sorted ny default, but that must not be true for every pdf. Especially updated pdfs are not sorted anymore.

sortByPosition = true

PDFBox extracts the text and tries to sort it using the position o each character. This works fine for simple texts. It gets more complicated and may lead to a false result if one of the following is used:

- different text sizes in the same line
- different font sizes in the same line
- super/subscripts
- multicolumns
- ....

BR
Andreas Lehmkühler

Reply via email to