[ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-722: ------------------------------------ Attachment: 000279.pdf > Arabic PDF doesn't extract correctly > ------------------------------------ > > Key: TIKA-722 > URL: https://issues.apache.org/jira/browse/TIKA-722 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Priority: Minor > Attachments: 000279.pdf > > > I have a PDF w/ Arabic font that Tika fails to extract (gets all > gibberish). > Looks like the PDF does not include the separate Unicode text metadata > (hmm: would Tika extract that if it were present?), and copy/paste out > of the PDF also produces gibberish. > To fix this I think we'd somehow have to know the mapping for the > font (this particular font is AXTManal)? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira