Tika can not parse all of the persian pdf files

ahmad ajiloo Sun, 11 Sep 2011 22:58:29 -0700

Hello
I used Tika (of course in Nutch) to parse some persian pdf files. some of
the files clearly transformed to a plain text. but about some of them,
output was corrupted. I used ICU4J v4 library and the text changed to
right-to-left mode. but the mentioned problem didn't resolve. insofar as
Tika can not understand any charachter of input persian pdf file!


I copy this text via Document Viewer in Linux: this is a clearly persian
text !
--------------------------
‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان"
بخواند.‬
‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف
صفحه است. (‬
‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت
حافظه مفيد است:‬
‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
--------------------------
Tike returns this output !
--------------------------
 92   @A   8 * B
   C9D  !D       ) (?)   =/
   >

 (<) ,    8 ;
 8 #

   +  9!:
     L
  #)    4   M() * 0>
 * -3    IA J
  - 2   (+   G
 H  -1
 (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
--------------------------
can anyone help me?
thanks a lot

Tika can not parse all of the persian pdf files

Reply via email to