Tika can not parse all of the persian pdf files
-----------------------------------------------
Key: TIKA-713
URL: https://issues.apache.org/jira/browse/TIKA-713
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.9
Reporter: Ahmad Ajiloo
Fix For: 0.9
Hello
I used Tika (of course in Nutch) to parse some persian pdf files. some of the
files clearly transformed to a plain text. but about some of them, output was
corrupted. I used ICU4J v4 library and the text changed to right-to-left mode.
but the mentioned problem didn't resolve. insofar as Tika can not understand
any charachter of input persian pdf file!
{quote}
I copy this text from my pdf file via Document Viewer in Linux: this is a
clearly persian text !
--------------------------
هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان"
بخواند.
) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه
است. (
همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت
حافظه مفيد است:
1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي
4- خوردن عسل 5- خوردن عدس 6- خوردن گوشت نزديک گردن
--------------------------
Tike returns this output !
--------------------------
92 @A 8 * B
C9D !D ) (?) =/
>
(<) , 8 ;
8 #
+ 9!:
L
#) 4 M() * 0>
* -3 IA J
- 2 (+ G
H -1
(+ J 5#+C 0T J (+ O - 6 R . (+ O - 5 PH. (+ O -4
--------------------------
{quote}
thanks a lot
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira