[jira] [Comment Edited] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping

John Hewson (JIRA) Mon, 08 Dec 2014 12:50:34 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238438#comment-14238438
 ]


John Hewson edited comment on PDFBOX-2532 at 12/8/14 8:50 PM:
--------------------------------------------------------------

It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic.

The CharSet entry can't be the deciding factor, because it is optional, and its 
entries are unordered, so it provides no help in identifying a "jumbled" 
encoding. Two different encodings which contain the same characters will have 
the same CharSet, even if their order is different.


was (Author: jahewson):
It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic.

The CharSet entry can't be the deciding factor, because it is optional, and its 
entries are unordered, so it provides no help in identifying a "jumbled" 
encoding. Two different encodings which contain the same characters will have 
the same CharSet, even if their order is different).

> Text extraction fails due to the usage of the internal font mapping
> -------------------------------------------------------------------
>
>                 Key: PDFBOX-2532
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2532
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Lehmkühler
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, 
> PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, 
> PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png
>
>
> If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode 
> mapping) we have to decide where to get a suitable mapping ourselves. We 
> can't use the internal font mapping of the type1C font as it doesn't work in 
> every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping

Reply via email to