Tika can not parse all of the persian pdf files -----------------------------------------------
Key: TIKA-713 URL: https://issues.apache.org/jira/browse/TIKA-713 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.9 Reporter: Ahmad Ajiloo Fix For: 0.9 Hello I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file! {quote} I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text ! -------------------------- هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند. ) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. ( همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است: 1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي 4- خوردن عسل 5- خوردن عدس 6- خوردن گوشت نزديک گردن -------------------------- Tike returns this output ! -------------------------- 92 @A 8 * B C9D !D ) (?) =/ > (<) , 8 ; 8 # + 9!: L #) 4 M() * 0> * -3 IA J - 2 (+ G H -1 (+ J 5#+C 0T J (+ O - 6 R . (+ O - 5 PH. (+ O -4 -------------------------- {quote} thanks a lot -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira