[ https://issues.apache.org/jira/browse/PDFBOX-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870249#comment-17870249 ]
Tilman Hausherr commented on PDFBOX-5857: ----------------------------------------- There is a solution in PDFBOX-3970 but it requires a change in PDFBox itself, and it isn't perfect, i.e. some files will look worse. > PDFTextStripper returns messed up data > --------------------------------------- > > Key: PDFBOX-5857 > URL: https://issues.apache.org/jira/browse/PDFBOX-5857 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 3.0.2 PDFBox > Reporter: arjunce > Priority: Minor > Attachments: extractedText.txt, jumbledtext.pdf, screenshot-1.png > > > I have attached below the input pdf and its text output for you to take a > look at. I am using PDFTextStripper along with these: > {code:java} > super(); > this.setSortByPosition(true); > this.setWordSeparator("_word_"); {code} > Since I am using sort by position the text is jumbled. Is there a way for me > to detect this instead of outputting the jumbled text? Any help is > appreciated, Thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org