[ 
https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119403#comment-13119403
 ] 

Robert Muir commented on TIKA-722:
----------------------------------

Actually in this case the original TTF font (AxtManal) is buggy.
The font actually uses glyph codes with a unicode mapping (1-1 to their unicode 
chars) but the names are WRONG.

So arabic glyphs in this font have misleading names like 'circumflex' and stuff 
like that in the font, causing 
whatever produced this PDF to be really confused when it embedded it... you can 
see this if you open the original TTF
in fontforge, it will give tons of warnings:

'The glyph named circumflex is mapped to U+F0F6 But its name indicates it 
should be mapped to U+02C6'

Its not possible to open the embedded font in the PDF, it claims its corrumpted 
:)

                
> Arabic PDF doesn't extract correctly
> ------------------------------------
>
>                 Key: TIKA-722
>                 URL: https://issues.apache.org/jira/browse/TIKA-722
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: 000279.pdf, JUFO96.PDF, metadata.png
>
>
> I have a PDF w/ Arabic font that Tika fails to extract (gets all
> gibberish).
> Looks like the PDF does not include the separate Unicode text metadata
> (hmm: would Tika extract that if it were present?), and copy/paste out
> of the PDF also produces gibberish.
> To fix this I think we'd somehow have to know the mapping for the
> font (this particular font is AXTManal)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to