Re: How to define regions in PDFTextStripperByArea?

Andreas Lehmkuehler Sun, 04 May 2014 04:16:17 -0700

Hi,

Am 02.05.2014 13:18, schrieb Qingchao Kong:

Paul,
I think I am aware the difference of
"stripper.setSortByPosition(true)" and
"stripper.setSortByPosition(false)". It is best explained when you try
to extract a PDF who has multiple columns, e.g. two columns.


When you have "stripper.setSortByPosition(false)", the extraction
result is usually the reading procedure which is fine. But when you
have "stripper.setSortByPosition(true)", PDFBox will extract text from
top to bottom, ignoring the columns, which is not expected by me.

I'm afraid there is a misunderstanding. PDFBox can't extract text contextsensitive. e.g. detecting columns, header or footer.


Just for clarification:

sortByPosition = false

PDFBox extracts the text following their appearance in the pdf. In most casesthe text will be sorted ny default, but that must not be true for every pdf.Especially updated pdfs are not sorted anymore.


sortByPosition = true

PDFBox extracts the text and tries to sort it using the position o eachcharacter. This works fine for simple texts. It gets more complicated and maylead to a false result if one of the following is used:


- different text sizes in the same line
- different font sizes in the same line
- super/subscripts
- multicolumns
- ....

BR
Andreas Lehmkühler

Re: How to define regions in PDFTextStripperByArea?

Reply via email to