[ 
https://issues.apache.org/jira/browse/PDFBOX-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292709#comment-16292709
 ] 

Tilman Hausherr commented on PDFBOX-4036:
-----------------------------------------

So NOTEPAD++ is the better product. I use it a lot to edit binary files (e.g. 
PDFs).

The PDF has a bad ToUnicode stream:
{code}
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
  /Registry (Adobe)
  /Ordering (UCS)
  /Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
21 beginbfrange
<0001><0001><0020>
<0002><0002><0041>
<0004><000a><0043>
<000c><000d><004b>
<000f><000f><004e>
<0011><0011><0050>
<0013><0016><0052>
<001c><0024><0061>
<0026><0035><006b>
<0077><0077><0000>
<0078><0078><0000>
<007b><007b><0026>
<007c><0080><0030>
<0082><0083><0036>
<0086><0086><002e>
<0087><0087><002c>
<0088><0088><003a>
<008f><008f><0027>
<009b><009b><002d>
<00a2><00a3><0028>
<00a8><00a8><002f>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
{code}
Look for 0077 and 0078 - these are the (hex) codes in that PDF used for "ff" 
and "ft". Their unicode value is 0 instead of the character values for "ff" and 
"ft". You can look at the PDF with PDFDebugger, it is the "G3" font, at 
positions 119 and 120 (decimal).

Adobe Reader isn't better - it delivers different (bad) values. If you want the 
"0" values to be replaced, you'll have to do it yourself in postprocessing.

> Invalid ToUnicode CMap in font
> ------------------------------
>
>                 Key: PDFBOX-4036
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4036
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.4, 2.0.8
>         Environment: Windows 10 64 bit, STS 3.9.1, JDK 1.8.0_152, Gradle
>            Reporter: Oleksii Zinkovskyi
>         Attachments: CSTA17.pdf
>
>
> While calling textStripper.getText(document) on the attached PDF file to 
> extract text and save it to .txt, I receive following warnings:
> {quote}Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+380 (380) in font 
> UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+381 (381) in font 
> UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font FANHRS+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+380 (380) in font 
> FANHRS+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+381 (381) in font 
> FANHRS+MaterialIcons-Regular{quote}
> In the end the file is generated and properly saved, but some letters are 
> missing (like "ft" in "software" or "ff" in "different"). So far I've tested 
> close to 10 files and this is the only problematic item I've found. Depending 
> on what program I use to view the resulting .txt file, I either get blank 
> spaces (Notepad) or "NUL" values (Notepad++) in place of the missing letters. 
> What's more, some editors (Sublime Text Editor) outright refuse to open the 
> file and view it as unreadable/corrupted byte code. Suffice to say working 
> with such a file is somewhat difficult...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to