[ https://issues.apache.org/jira/browse/PDFBOX-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870253#comment-17870253 ]
arjunce commented on PDFBOX-5857: --------------------------------- Hey Tilman, Thanks for the reply. Right now I am doing this to abort the processing of documents like this and eventually OCR it to get the correct data. {code:java} Set<Float> baseline = textPositions.stream().map(TextPosition::getYDirAdj).collect(Collectors.toSet()); if(Collections.max(baseline) - Collections.min(baseline) > 10) throw new RuntimeException("Font size not consistent. Cannot be processed."); {code} I don't want to go too much into the document and it would be great if you could suggest a better way to check this > PDFTextStripper returns messed up data > --------------------------------------- > > Key: PDFBOX-5857 > URL: https://issues.apache.org/jira/browse/PDFBOX-5857 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 3.0.2 PDFBox > Reporter: arjunce > Priority: Minor > Attachments: extractedText.txt, jumbledtext.pdf, screenshot-1.png > > > I have attached below the input pdf and its text output for you to take a > look at. I am using PDFTextStripper along with these: > {code:java} > super(); > this.setSortByPosition(true); > this.setWordSeparator("_word_"); {code} > Since I am using sort by position the text is jumbled. Is there a way for me > to detect this instead of outputting the jumbled text? Any help is > appreciated, Thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org