[ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108031#comment-13108031 ]
Michael McCandless commented on TIKA-722: ----------------------------------------- Thanks Uwe; it sounds like there's not much we can do for such old PDFs. > Arabic PDF doesn't extract correctly > ------------------------------------ > > Key: TIKA-722 > URL: https://issues.apache.org/jira/browse/TIKA-722 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Priority: Minor > Attachments: 000279.pdf, JUFO96.PDF, metadata.png > > > I have a PDF w/ Arabic font that Tika fails to extract (gets all > gibberish). > Looks like the PDF does not include the separate Unicode text metadata > (hmm: would Tika extract that if it were present?), and copy/paste out > of the PDF also produces gibberish. > To fix this I think we'd somehow have to know the mapping for the > font (this particular font is AXTManal)? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira