Tika can not parse all of the persian pdf files
-----------------------------------------------

                 Key: TIKA-713
                 URL: https://issues.apache.org/jira/browse/TIKA-713
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
            Reporter: Ahmad Ajiloo
             Fix For: 0.9


Hello
I used Tika (of course in Nutch) to parse some persian pdf files. some of the 
files clearly transformed to a plain text. but about some of them, output was 
corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. 
but the mentioned problem didn't resolve. insofar as Tika can not understand 
any charachter of input persian pdf file!

{quote}
I copy this text from my pdf file via Document Viewer in Linux: this is a 
clearly persian text !
--------------------------
‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" 
بخواند.‬
‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه 
است. (‬
‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت 
حافظه مفيد است:‬
‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
--------------------------
Tike returns this output !
--------------------------
 92   @A   8 * B
   C9D  !D       ) (?)   =/
   >
 
 (<) ,    8 ;  
 8 #

   +  9!: 
     L
  #)    4   M() * 0>
 * -3    IA J  
  - 2   (+   G
 H  -1
 (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
--------------------------
{quote}
thanks a lot

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to