[ 
https://issues.apache.org/jira/browse/PDFBOX-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4036:
------------------------------------
    Attachment:     (was: PDFBOX-4036-reduced.pdf)

> Invalid ToUnicode CMap in font
> ------------------------------
>
>                 Key: PDFBOX-4036
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4036
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.4, 2.0.8
>         Environment: Windows 10 64 bit, STS 3.9.1, JDK 1.8.0_152, Gradle
>            Reporter: Oleksii Zinkovskyi
>         Attachments: CSTA17.pdf, PDFBOX-4036-reduced.pdf
>
>
> While calling textStripper.getText(document) on the attached PDF file to 
> extract text and save it to .txt, I receive following warnings:
> {quote}Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+380 (380) in font 
> UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+381 (381) in font 
> UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font FANHRS+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+380 (380) in font 
> FANHRS+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+381 (381) in font 
> FANHRS+MaterialIcons-Regular{quote}
> In the end the file is generated and properly saved, but some letters are 
> missing (like "ft" in "software" or "ff" in "different"). So far I've tested 
> close to 10 files and this is the only problematic item I've found. Depending 
> on what program I use to view the resulting .txt file, I either get blank 
> spaces (Notepad) or "NUL" values (Notepad++) in place of the missing letters. 
> What's more, some editors (Sublime Text Editor) outright refuse to open the 
> file and view it as unreadable/corrupted byte code. Suffice to say working 
> with such a file is somewhat difficult...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to