[jira] [Commented] (PDFBOX-941) extracting Japanese characters gives garbage

Kevin Clark (Commented) (JIRA) Sat, 01 Oct 2011 10:30:56 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118855#comment-13118855
 ]


Kevin Clark commented on PDFBOX-941:
------------------------------------

I'm seeing this with the Tika 0.10 release which uses 1.6.0:

2011-10-01 16:15:43,516 (53344917) [Parser-thread-1] ERROR 
org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP 
file for 'Adobe-Japan1-UCS2'

                
> extracting Japanese characters gives garbage
> --------------------------------------------
>
>                 Key: PDFBOX-941
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-941
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>         Environment: java 1.6 on CentOS 64bit Linux and MacOSX 10.6
>            Reporter: Liang Qu
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.5.0
>
>         Attachments: 1010gaiyou.pdf
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> when extracting text from this pdf file, I got this exception, and the text 
> extracted was gibberish.
> 44 [main] ERROR org.apache.pdfbox.pdmodel.font.PDFont  - Error: Could not 
> parse predefined CMAP file for 'Adobe-Japan1-UCS2'
> PDFBox 1.2.1 worked fine with the same file, I wonder why 1.4.0 could not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-941) extracting Japanese characters gives garbage

Reply via email to