[jira] [Commented] (PDFBOX-4758) Text Extractor does not handle common typographic ligatures

Tilman Hausherr (Jira) Thu, 30 Jan 2020 20:30:19 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027183#comment-17027183
 ]


Tilman Hausherr commented on PDFBOX-4758:
-----------------------------------------

The Libre Office text extraction is perfect, you need to enable sort. The MS 
Word text extraction is terrible, but identical to the one with Adobe. The bad 
parts are due to an incorrect /ToUnicode stream (at least for the first error, 
I didn't look for the others).

> Text Extractor does not handle common typographic ligatures
> -----------------------------------------------------------
>
>                 Key: PDFBOX-4758
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4758
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.1, 2.0.18
>            Reporter: Michael Reynolds
>            Priority: Major
>         Attachments: TestExtractText.java, libreoffice-ligatures-test.pdf, 
> msword-ligatures-test.pdf
>
>
> TextExtractor mishandles typographic ligatures. I've attached test documents 
> from both Microsoft Word and LibreOffice.
> I've checked PDFBox's output against xPDF on CentOS, and the ligatures are 
> properly handled with that utililty, so it appears that this is a PDFBox 
> defect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4758) Text Extractor does not handle common typographic ligatures

Reply via email to