[ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe Schindler updated TIKA-722: ------------------------------- Attachment: JUFO96.PDF Here is a non-persian example (which is actually a very-very early writeup from myself, back in 1996, from my personal archive - don't read it). If you try to copypaste text out of it you will see the same problem. It's also Acrobat Distiller 3.0 with font subsets. > Arabic PDF doesn't extract correctly > ------------------------------------ > > Key: TIKA-722 > URL: https://issues.apache.org/jira/browse/TIKA-722 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Priority: Minor > Attachments: 000279.pdf, JUFO96.PDF, metadata.png > > > I have a PDF w/ Arabic font that Tika fails to extract (gets all > gibberish). > Looks like the PDF does not include the separate Unicode text metadata > (hmm: would Tika extract that if it were present?), and copy/paste out > of the PDF also produces gibberish. > To fix this I think we'd somehow have to know the mapping for the > font (this particular font is AXTManal)? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira