Incorrect ordering of compound Arabic glyphs
--------------------------------------------
Key: PDFBOX-684
URL: https://issues.apache.org/jira/browse/PDFBOX-684
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.1.0, 1.0.0
Reporter: Yigal Dayan
Priority: Minor
Some Arabic PDFs contain compound glyphs for stylistic reasons.
Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.
Before a line gets sent to the bidirectional algorithm, all characters have
been sorted into a visual order, except for these pairs. This is because they
are handled as one unit and maintain their original (logical) order. The bidi
algorithm straightens out most characters, but reverses the glyph pairs.
To fix this, the output of font.encode() should be examined and reversed on the
spot if it contains pairs of Arabic characters. Possibly you need to add a stub
method to PDFStreamEngine (in method processEncodedText) that PDFTextStripper
can override (in sort mode only).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.