Hello,

I'm currently working on a text extraction and highlighting pipeline with PDFBox. I was able to adapt the org.apache.pdfbox.ExtractText example to my needs. But I have observed some weird behaviors: in some documents, the text order is corrupt, but after splitting the document into pages with pdftk and again extraction the text, the text order is right. Now the questions is: what is wrong with the document. I have quite a lot of documents where this can be observed.

If you would like to reproduce this behavior, have a look on this document (http://www.pubmedcentral.nih.gov/picrender.fcgi?artid=2553346&blobtype=pdf). After running ExtractText on the first page, you can see that the text of the last left column was added to the end of the output. Now if you burst the document with pdftk(http://www.accesspdf.com/pdftk/) and run ExtractText again the extracted text flow is correct. The ExtractText sort option makes the result even worse and I'm using the latest SVN checkout of PDFBox.

Do you have any explanation for this behavior?

Thanks in advance for any help.

Yours Sincerely,
Robert Pesch

Reply via email to