[
https://issues.apache.org/jira/browse/PDFBOX-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027183#comment-17027183
]
Tilman Hausherr commented on PDFBOX-4758:
-----------------------------------------
The Libre Office text extraction is perfect, you need to enable sort. The MS
Word text extraction is terrible, but identical to the one with Adobe. The bad
parts are due to an incorrect /ToUnicode stream (at least for the first error,
I didn't look for the others).
> Text Extractor does not handle common typographic ligatures
> -----------------------------------------------------------
>
> Key: PDFBOX-4758
> URL: https://issues.apache.org/jira/browse/PDFBOX-4758
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.1, 2.0.18
> Reporter: Michael Reynolds
> Priority: Major
> Attachments: TestExtractText.java, libreoffice-ligatures-test.pdf,
> msword-ligatures-test.pdf
>
>
> TextExtractor mishandles typographic ligatures. I've attached test documents
> from both Microsoft Word and LibreOffice.
> I've checked PDFBox's output against xPDF on CentOS, and the ligatures are
> properly handled with that utililty, so it appears that this is a PDFBox
> defect.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]