[ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103394#comment-13103394 ]
Robert Muir commented on TIKA-713: ---------------------------------- Thanks Ahmad... I took a look at this PDF and I suspect this is the problem: The fonts contained in the document have custom font encodings, I opened them up in fontforge and e.g. arabic alef maps to U+0006. So thats why you see the garbage, its actually unrelated to ICU/bidirectional algorithm. I think the reason copy/paste works fine in this document is because it probably has unicode PDF metadata... maybe PDFBox doesn't support this? Disclaimer: I didn't look at any pdfbox code yet or really try to debug it. > Tika can not parse all of the persian pdf files > ----------------------------------------------- > > Key: TIKA-713 > URL: https://issues.apache.org/jira/browse/TIKA-713 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.9 > Reporter: Ahmad Ajiloo > Fix For: 0.9 > > Attachments: ebrat.pdf > > > Hello > I used Tika (of course in Nutch) to parse some persian pdf files. some of the > files clearly transformed to a plain text. but about some of them, output was > corrupted. I used ICU4J v4 library and the text changed to right-to-left > mode. but the mentioned problem didn't resolve. insofar as Tika can not > understand any charachter of input persian pdf file! > {quote} > I copy this text from my pdf file via Document Viewer in Linux: this is a > clearly persian text ! > -------------------------- > هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" > بخواند. > ) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف > صفحه است. ( > همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت > حافظه مفيد است: > 1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي > 4- خوردن عسل 5- خوردن عدس 6- خوردن گوشت نزديک گردن > -------------------------- > Tike returns this output ! > -------------------------- > 92 @A 8 * B > C9D !D ) (?) =/ > > > > (<) , 8 ; > 8 # > + 9!: > L > #) 4 M() * 0> > * -3 IA J > - 2 (+ G > H -1 > (+ J 5#+C 0T J (+ O - 6 R . (+ O - 5 PH. (+ O -4 > -------------------------- > {quote} > thanks a lot -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira