[
https://issues.apache.org/jira/browse/PDFBOX-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126212#comment-13126212
]
Antoni Mylka commented on PDFBOX-940:
-------------------------------------
I stumbled upon the same problem, on a confidential file. In the process I
think I found an issue: PDFBOX-1137.
I'm not a PDF expert, but in that file, I have the following PDF objects:
24 0 obj
<</Type/Font/Subtype/Type0/BaseFont/TT491A9C96tCID/Encoding 18 0
R/DescendantFonts[22 0 R]>>
endobj
22 0 obj
<</Subtype/CIDFontType2/CIDSystemInfo 23 0
R/BaseFont/XJXBKC+TT491A9C96tCID/Type/Font/Name/R22/FontDescriptor 21 0 R/DW
1000
/W[691[259]
724[677
626
626]
737[677]]/CIDToGIDMap/Identity
>>
endobj
18 0 obj
<</Type/CMap/Name/R18/WMode 0/CMapName/WinCharSetFFFF-H/CIDSystemInfo<<
/Registry(Adobe)
/Ordering(WinCharSetFFFF)
/Supplement 0
>>
/Filter/FlateDecode/Length 19 0 R>>stream
endstream
endobj
So there is an embedded CMAP for WinCharSetFFFF-H, a parent font which refers
to the embedded CMAP as its encoding, and a child font with no encoding.
Applying the PDFBOX-1137 patch allowed the CMAP to be parsed.
Then, in PDType0Font constructor, I added an if, just after the descendant font
is constructed, I made it "inherit" the cmap from the parent font. This fixed
NPEs during text extraction, which happened because the cmap was missing:
descendentFont = PDFontFactory.createFont( descendantFontDictionary );
if (descendentFont.cmap == null) {
descendentFont.cmap = this.cmap;
}
I don't even know if this makes sense. Is the descendant font supposed to
"inherit" the encoding from the parent font? This "fixed" the visible errors,
but the output I get is still garbled. It's supposed to be a text in
traditional Chinese. Can anyone with more PDF knowledge take a look at this?
> [pdmodel.font.PDFont] Error: Could not parse predefined CMAP file for
> 'PDFXC-Indentity0-0'
> -------------------------------------------------------------------------------------------
>
> Key: PDFBOX-940
> URL: https://issues.apache.org/jira/browse/PDFBOX-940
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.4.0
> Environment: Tomcat 6.0.18, windows server 2003, pdfbox-1.4.0.jar
> Reporter: krishna
> Attachments: gen_preview1.png, oob_pdf.pdf, pdf fonts.JPG, pdf
> fonts1.JPG, pdf fonts2.JPG, pdf properties1.JPG, pdf properties2.JPG, pdf
> properties3.JPG
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> Hi,
> when i am trying to upload a pdf document the following error is thrown in
> the tomcat.. i am using pdfbox-1.4.0.jar..
> 17:29:33,465 ERROR [pdmodel.font.PDFont] Error: Could not parse predefined
> CMAP file for 'PDFXC-Indentity0-0'
> please find the solution
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira