[ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140134#comment-13140134 ]
Ahmad Ajiloo commented on TIKA-713: ----------------------------------- I'm testing new Encoding.java file with other persian pdf files. there is a new file which name is Simple2.pdf that pdfbox can not parse it. please find the attachment. thanks > Tika can not parse all of the persian pdf files > ----------------------------------------------- > > Key: TIKA-713 > URL: https://issues.apache.org/jira/browse/TIKA-713 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.9 > Reporter: Ahmad Ajiloo > Attachments: Simple2.pdf, ebrat.pdf > > > Hello > I used Tika (of course in Nutch) to parse some persian pdf files. some of the > files clearly transformed to a plain text. but about some of them, output was > corrupted. I used ICU4J v4 library and the text changed to right-to-left > mode. but the mentioned problem didn't resolve. insofar as Tika can not > understand any charachter of input persian pdf file! > {quote} > I copy this text from my pdf file via Document Viewer in Linux: this is a > clearly persian text ! > -------------------------- > هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" > بخواند. > ) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف > صفحه است. ( > همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت > حافظه مفيد است: > 1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي > 4- خوردن عسل 5- خوردن عدس 6- خوردن گوشت نزديک گردن > -------------------------- > Tike returns this output ! > -------------------------- > 92 @A 8 * B > C9D !D ) (?) =/ > > > > (<) , 8 ; > 8 # > + 9!: > L > #) 4 M() * 0> > * -3 IA J > - 2 (+ G > H -1 > (+ J 5#+C 0T J (+ O - 6 R . (+ O - 5 PH. (+ O -4 > -------------------------- > {quote} > thanks a lot -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira