Hello,
I'm currently working on a text extraction and highlighting pipeline
with PDFBox. I was able to adapt the org.apache.pdfbox.ExtractText
example to my needs. But I have observed some weird behaviors: in some
documents, the text order is corrupt, but after splitting the document
into pages with pdftk and again extraction the text, the text order is
right. Now the questions is: what is wrong with the document. I have
quite a lot of documents where this can be observed.
If you would like to reproduce this behavior, have a look on this
document
(http://www.pubmedcentral.nih.gov/picrender.fcgi?artid=2553346&blobtype=pdf).
After running ExtractText on the first page, you can see that the text
of the last left column was added to the end of the output. Now if you
burst the document with pdftk(http://www.accesspdf.com/pdftk/) and run
ExtractText again the extracted text flow is correct.
The ExtractText sort option makes the result even worse and I'm using
the latest SVN checkout of PDFBox.
Do you have any explanation for this behavior?
Thanks in advance for any help.
Yours Sincerely,
Robert Pesch