[ https://issues.apache.org/jira/browse/PDFBOX-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194880#comment-14194880 ]
Andreas Lehmkühler commented on PDFBOX-2463: -------------------------------------------- Which area did you use? Extracting the whole text works like a charm. > ExtractTextByArea mangling second half of this string - transposed, skipped, > etc > -------------------------------------------------------------------------------- > > Key: PDFBOX-2463 > URL: https://issues.apache.org/jira/browse/PDFBOX-2463 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.7 > Reporter: Joel Hirsh > Attachments: mangled_text .pdf > > > PDF snippet is being completely mangled by ExtractTextByArea. Have a large > PDF file where this is happening on every line. > Visually (and Acrobat) show the text: > 12 Jun EP COPY WORKS LIMITED 503646200256 5637 3.70 11,252.49 OD > However ExtractTextByArea comes up with: > 12 Jun EP COPY WORKS LIMITED 503646200256 35 .6 70 > 11, > 3 257 2.49 > OD > So the first half of the string is ok, but starting at '5637' characters are > skipped, other characters are inserted, completely mangled. > FWIW I did dump the COSString's in PDFStreamEngine and the strings all show > correctly, nothing unusual. > Test file to be attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)