[ 
https://issues.apache.org/jira/browse/PDFBOX-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved PDFBOX-684.
----------------------------------

         Assignee: Jukka Zitting
    Fix Version/s: 1.2.0
       Resolution: Fixed

Patch committed in revision 956624. Thanks!

> Incorrect ordering of compound Arabic glyphs
> --------------------------------------------
>
>                 Key: PDFBOX-684
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-684
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Yigal Dayan
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 1.2.0
>
>         Attachments: PDFStreamEngine.patch, PDFTextStripper.patch, 
> zzz.after_fix.txt, zzz.before_fix.txt, zzz.pdf
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Some Arabic PDFs contain compound glyphs for stylistic reasons.
> Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.
> Before a line gets sent to the bidirectional algorithm, all characters have 
> been sorted into a visual order, except for these pairs. This is because they 
> are handled as one unit and maintain their original (logical) order. The bidi 
> algorithm straightens out most characters, but reverses the glyph pairs.
> To fix this, the output of font.encode() should be examined and reversed on 
> the spot if it contains pairs of Arabic characters. Possibly you need to add 
> a stub method to PDFStreamEngine (in method processEncodedText) that 
> PDFTextStripper can override (in sort mode only).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to