[ https://issues.apache.org/jira/browse/PDFBOX-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
arjunce updated PDFBOX-5828: ---------------------------- Labels: (was: easyfix) > PDFTextStripper created garbled text > ------------------------------------ > > Key: PDFBOX-5828 > URL: https://issues.apache.org/jira/browse/PDFBOX-5828 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 3.0.1 PDFBox, 3.0.2 PDFBox > Reporter: arjunce > Priority: Major > Attachments: output_text_stripper.txt, screenshot-1.png, test.pdf > > > Hello Folks, > I am using pdfbox to extract and manipulate text contents of the pdf and > using PDFTextStripper to extract text. I am also setting the below options: > {code:java} > PDDocument document = Loader.loadPDF(new > File("src/test/resources/pdf/test.pdf")); > PDFTextStripper textStripper = new PDFTextStripper(); > textStripper.setSortByPosition(true); > textStripper.setWordSeparator(" "); {code} > The Textcomparator is not transitive as mentioned in a comment. The custom > merge sort implemented is messing up the text at the Individual character > level and I can't fix the text later. > I have attached the sample pdf and its text output below. The merge sort > doesn't consider the y coordinates and x coordinates when sorting the > letters. Adding that while sorting would fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org