[
https://issues.apache.org/jira/browse/PDFBOX-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-4036:
------------------------------------
Attachment: PDFBOX-4036-reduced.pdf
> Invalid ToUnicode CMap in font
> ------------------------------
>
> Key: PDFBOX-4036
> URL: https://issues.apache.org/jira/browse/PDFBOX-4036
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.4, 2.0.8
> Environment: Windows 10 64 bit, STS 3.9.1, JDK 1.8.0_152, Gradle
> Reporter: Oleksii Zinkovskyi
> Attachments: CSTA17.pdf, PDFBOX-4036-reduced.pdf,
> PDFBOX-4036-reduced.pdf
>
>
> While calling textStripper.getText(document) on the attached PDF file to
> extract text and save it to .txt, I receive following warnings:
> {quote}Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+380 (380) in font
> UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+381 (381) in font
> UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font FANHRS+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+380 (380) in font
> FANHRS+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+381 (381) in font
> FANHRS+MaterialIcons-Regular{quote}
> In the end the file is generated and properly saved, but some letters are
> missing (like "ft" in "software" or "ff" in "different"). So far I've tested
> close to 10 files and this is the only problematic item I've found. Depending
> on what program I use to view the resulting .txt file, I either get blank
> spaces (Notepad) or "NUL" values (Notepad++) in place of the missing letters.
> What's more, some editors (Sublime Text Editor) outright refuse to open the
> file and view it as unreadable/corrupted byte code. Suffice to say working
> with such a file is somewhat difficult...
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]