[ https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233455#comment-14233455 ]
John Hewson edited comment on PDFBOX-2532 at 12/3/14 8:04 PM: -------------------------------------------------------------- The Type1C fonts are using built-in encodings, which are corrupted. Any thoughts on how Acrobat is able to extract the text? I notice that Acrobat's preflight lists the fonts' encodings as "built-in". was (Author: jahewson): The Type1C fonts are using built-in encodings, which are corrupted. Any thoughts on how Acrobat is able to extract the text? I notice that Acrobat's preflight lists the font's encodings as "built-in". > Text extraction fails due to the usage of the internal font mapping > ------------------------------------------------------------------- > > Key: PDFBOX-2532 > URL: https://issues.apache.org/jira/browse/PDFBOX-2532 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Andreas Lehmkühler > Fix For: 2.0.0 > > Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, > PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, > PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png > > > If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode > mapping) we have to decide where to get a suitable mapping ourselves. We > can't use the internal font mapping of the type1C font as it doesn't work in > every case, see PDFBOX-2377 which provides a solution for the 1.8-branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)