Re: Help needed to resolve issue with converting Arabic characters to presentation forms

Hesham G. Wed, 15 Feb 2012 20:41:34 -0800

Hamed ,

Nice effort .. Thanks for sharing the nice information. 
I hope you will be able to overcome this, and share your solution.



Best regards ,
Hesham 


---------------------------------------------
Included message :

> Hi,
> 
> I'm trying to resolve PDFBOX-1216 that I reported a while ago by
> debugging the PDFBox source code, and I need some advice on what to
> do. In brief, the issue is that PDFBox doesn't use presentation forms
> when creating PDF images for Arabic / Persian text in PDF, thus the
> characters are shown disconnected. I'm not sure yet, but I guess this
> is called "ligature"?
> 
> Anyway, here's what I concluded so far, and if anyone could guide me,
> I may be able to fix this and provide a patch.
> 
> * In PDF file, different codes are used for different presentation
> forms of a single unicode character (under Content stream of PDF file,
> under "TJ" command which is "show text, allowing individual glyph
> positioning")
> 
> * In the "ToUnicode" table of PDF file (which is read into the "cmap"
> variable of PDFont class), all the presentation forms are mapped to
> the same unicode character (which is not in the presentation range)
> 
> * When PDFBox is drawing text on graphics canvas, it uses the unicode
> value in a string and calls "PDSimpleFont.drawStirng" method.
> 
> * Since the single character is isolated, it is either not found in
> the Font, or the isolated form (if present) is rendered.
> 
> Example:
> 
> You can check characters in the following address:
> http://en.wikipedia.org/wiki/Arabic_characters_in_Unicode
> 
> When there is a U+0647 character in the file ( ه ), and should be
> connected to the character before it, it should appear as U+FEEA ( ﻪ
> ).
> In the attached PDF file, this character appears in two different
> fonts. Internal PDF code for the this character in the fonts are
> "00C4" and "03EA".
> 
> When I set a breakpoint in "PDSimpleFont.drawStirng" method, and
> manually replace the string content with the appropriate presentation
> form (like "\ufeea" for the above character) everything else works
> fine and the output image is correct (it is found in the Font, where
> the original character, "\u0647", is not embedded in the font).
> 
> PDF viewers have some way of figuring out the presentation forms,
> because the PDF is displayed correctly in all viewers.
> 
> But I could not find out how can I determine which character code
> should be mapped to which presentation form. I'm not very familiar
> with the internals of PDF file, if any of the developers can guide me
> on where to look next, I'd hopefully be able to figure out a way to
> fix this.
> 
> Thanks in advance
> Hamed
>

Re: Help needed to resolve issue with converting Arabic characters to presentation forms

Reply via email to