[
https://issues.apache.org/jira/browse/PDFBOX-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118998#comment-13118998
]
Andreas Lehmkühler commented on PDFBOX-941:
-------------------------------------------
I'm quite sure that this is not related to PDFBox as it works fine here
(without using tika). Probably a misconfigured environment (missing resource
files)?
> extracting Japanese characters gives garbage
> --------------------------------------------
>
> Key: PDFBOX-941
> URL: https://issues.apache.org/jira/browse/PDFBOX-941
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.4.0
> Environment: java 1.6 on CentOS 64bit Linux and MacOSX 10.6
> Reporter: Liang Qu
> Assignee: Andreas Lehmkühler
> Fix For: 1.5.0
>
> Attachments: 1010gaiyou.pdf
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> when extracting text from this pdf file, I got this exception, and the text
> extracted was gibberish.
> 44 [main] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not
> parse predefined CMAP file for 'Adobe-Japan1-UCS2'
> PDFBox 1.2.1 worked fine with the same file, I wonder why 1.4.0 could not.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira