[
https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649664#action_12649664
]
Jukka Zitting commented on PDFBOX-377:
--------------------------------------
Yeah, I guess using ICU4J is reasonable in this case. Could the code be
organized so that ICU4J is only loaded when such normalization is needed?
> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
> Key: PDFBOX-377
> URL: https://issues.apache.org/jira/browse/PDFBOX-377
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Reporter: Brian Carrier
> Attachments: hello3.pdf, PDFTextStripper.diff
>
>
> Arabic text (and other right to left languages) is stored in presentation
> format in PDF files, which is the opposite of the logical order that Arabic
> text is typically stored. Arabic text is typically stored such that the first
> byte is for the right-most character, but the output of PDFBox has the first
> byte always being the left-most character.
> Further, PDF files typically store the presentation form of Arabic characters
> instead the more general form. For example, U+FB50 instead of U+0671. The
> presentation form is not supposed to be stored in the logical form, but
> PDFBox does not normalize them out.
> The attached patch solves both of these problems using the ICU4J library
> (http://www.icu-project.org/). It identifies the dominant text direction of
> each page and reverses the order of each line (only if any right to left text
> exists). It then normalizes the text to remove the presentation forms.
> An example file is attached. Without the patch, the following is
> (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World.
> With the patch, the following is (correctly) produced:
> Hello محمد World.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.