[
https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689132#action_12689132
]
Andreas Lehmkühler commented on PDFBOX-420:
-------------------------------------------
As far as a understand the whole encoding stuff the issue comes up every time
truetype-CID-fonts are used. Whenever these kind of fonts is used "Identity-H"
is used as encoding. The patch maps these encoding to the characterset "JIS"
which stands for a ISO-2022-JP, a japanese mapping (see
org.apache.pdfbox.encoding.conversion.CJKEncodings.java).
So finally I don't know where to find the solution. Is it wrong to simply map
"Identity-H" to "JIS" or is the reason for this problem the missing support for
CID-fonts.
Any suggestions or hints for solving this issue?
> Japanese Characters are garbled.
> --------------------------------
>
> Key: PDFBOX-420
> URL: https://issues.apache.org/jira/browse/PDFBOX-420
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Reporter: Takashi Komatsubara
> Priority: Critical
> Attachments: supportJapanese-fontbox.patch, supportJapanese.patch,
> TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.