Incorrect direction of extracted Arabic Text
--------------------------------------------

                 Key: PDFBOX-377
                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.8.0-incubator
            Reporter: Brian Carrier


Arabic text (and other right to left languages) is stored in presentation 
format in PDF files, which is the opposite of the logical order that Arabic 
text is typically stored. Arabic text is typically stored such that the first 
byte is for the right-most character, but the output of PDFBox has the first 
byte always being the left-most character. 

Further, PDF files typically store the presentation form of Arabic characters 
instead the more general form. For example, U+FB50 instead of U+0671. The 
presentation form is not supposed to be stored in the logical form, but PDFBox 
does not normalize them out. 

The attached patch solves both of these problems using the ICU4J library 
(http://www.icu-project.org/).  It identifies the dominant text direction of 
each page and reverses the order of each line (only if any right to left text 
exists).  It then normalizes the text to remove the presentation forms. 

An example file is attached.  Without the patch, the following is (incorrectly) 
produced:
Hello ﺪﻤﺤﻣ World. 

With the patch, the following is (correctly) produced:
Hello محمد World. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to