Text extraction and text flow

Robert Pesch Mon, 08 Jun 2009 05:35:51 -0700

Hello,

I'm currently working on a text extraction and highlighting pipelinewith PDFBox. I was able to adapt the org.apache.pdfbox.ExtractTextexample to my needs. But I have observed some weird behaviors: in somedocuments, the text order is corrupt, but after splitting the documentinto pages with pdftk and again extraction the text, the text order isright. Now the questions is: what is wrong with the document. I havequite a lot of documents where this can be observed.

If you would like to reproduce this behavior, have a look on thisdocument(http://www.pubmedcentral.nih.gov/picrender.fcgi?artid=2553346&blobtype=pdf).After running ExtractText on the first page, you can see that the textof the last left column was added to the end of the output. Now if youburst the document with pdftk(http://www.accesspdf.com/pdftk/) and runExtractText again the extracted text flow is correct.The ExtractText sort option makes the result even worse and I'm usingthe latest SVN checkout of PDFBox.


Do you have any explanation for this behavior?

Thanks in advance for any help.

Yours Sincerely,
Robert Pesch

Text extraction and text flow

Reply via email to