[jira] [Updated] (PDFBOX-5828) PDFTextStripper created garbled text

arjunce (Jira) Fri, 24 May 2024 02:56:28 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


arjunce updated PDFBOX-5828:
----------------------------
    Labels:   (was: easyfix)

> PDFTextStripper created garbled text
> ------------------------------------
>
>                 Key: PDFBOX-5828
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5828
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.1 PDFBox, 3.0.2 PDFBox
>            Reporter: arjunce
>            Priority: Major
>         Attachments: output_text_stripper.txt, screenshot-1.png, test.pdf
>
>
> Hello Folks, 
> I am using pdfbox to extract and manipulate text contents of the pdf and 
> using PDFTextStripper to extract text. I am also setting the below options:
> {code:java}
> PDDocument document = Loader.loadPDF(new 
> File("src/test/resources/pdf/test.pdf"));
> PDFTextStripper textStripper = new PDFTextStripper();
> textStripper.setSortByPosition(true);
> textStripper.setWordSeparator(" "); {code}
> The Textcomparator is not transitive as mentioned in a comment. The custom 
> merge sort implemented is messing up the text at the Individual character 
> level and I can't fix the text later. 
> I have attached the sample pdf and its text output below. The merge sort 
> doesn't consider the y coordinates and x coordinates when sorting the 
> letters. Adding that while sorting would fix this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5828) PDFTextStripper created garbled text

Reply via email to