[
https://issues.apache.org/jira/browse/PDFBOX-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved PDFBOX-684.
----------------------------------
Assignee: Jukka Zitting
Fix Version/s: 1.2.0
Resolution: Fixed
Patch committed in revision 956624. Thanks!
> Incorrect ordering of compound Arabic glyphs
> --------------------------------------------
>
> Key: PDFBOX-684
> URL: https://issues.apache.org/jira/browse/PDFBOX-684
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.0.0, 1.1.0
> Reporter: Yigal Dayan
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 1.2.0
>
> Attachments: PDFStreamEngine.patch, PDFTextStripper.patch,
> zzz.after_fix.txt, zzz.before_fix.txt, zzz.pdf
>
> Original Estimate: 3h
> Remaining Estimate: 3h
>
> Some Arabic PDFs contain compound glyphs for stylistic reasons.
> Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.
> Before a line gets sent to the bidirectional algorithm, all characters have
> been sorted into a visual order, except for these pairs. This is because they
> are handled as one unit and maintain their original (logical) order. The bidi
> algorithm straightens out most characters, but reverses the glyph pairs.
> To fix this, the output of font.encode() should be examined and reversed on
> the spot if it contains pairs of Arabic characters. Possibly you need to add
> a stub method to PDFStreamEngine (in method processEncodedText) that
> PDFTextStripper can override (in sort mode only).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.