[ https://issues.apache.org/jira/browse/PDFBOX-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved PDFBOX-684. ---------------------------------- Assignee: Jukka Zitting Fix Version/s: 1.2.0 Resolution: Fixed Patch committed in revision 956624. Thanks! > Incorrect ordering of compound Arabic glyphs > -------------------------------------------- > > Key: PDFBOX-684 > URL: https://issues.apache.org/jira/browse/PDFBOX-684 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.0.0, 1.1.0 > Reporter: Yigal Dayan > Assignee: Jukka Zitting > Priority: Minor > Fix For: 1.2.0 > > Attachments: PDFStreamEngine.patch, PDFTextStripper.patch, > zzz.after_fix.txt, zzz.before_fix.txt, zzz.pdf > > Original Estimate: 3h > Remaining Estimate: 3h > > Some Arabic PDFs contain compound glyphs for stylistic reasons. > Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc. > Before a line gets sent to the bidirectional algorithm, all characters have > been sorted into a visual order, except for these pairs. This is because they > are handled as one unit and maintain their original (logical) order. The bidi > algorithm straightens out most characters, but reverses the glyph pairs. > To fix this, the output of font.encode() should be examined and reversed on > the spot if it contains pairs of Arabic characters. Possibly you need to add > a stub method to PDFStreamEngine (in method processEncodedText) that > PDFTextStripper can override (in sort mode only). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.