Hamed , Nice effort .. Thanks for sharing the nice information. I hope you will be able to overcome this, and share your solution.
Best regards , Hesham --------------------------------------------- Included message : > Hi, > > I'm trying to resolve PDFBOX-1216 that I reported a while ago by > debugging the PDFBox source code, and I need some advice on what to > do. In brief, the issue is that PDFBox doesn't use presentation forms > when creating PDF images for Arabic / Persian text in PDF, thus the > characters are shown disconnected. I'm not sure yet, but I guess this > is called "ligature"? > > Anyway, here's what I concluded so far, and if anyone could guide me, > I may be able to fix this and provide a patch. > > * In PDF file, different codes are used for different presentation > forms of a single unicode character (under Content stream of PDF file, > under "TJ" command which is "show text, allowing individual glyph > positioning") > > * In the "ToUnicode" table of PDF file (which is read into the "cmap" > variable of PDFont class), all the presentation forms are mapped to > the same unicode character (which is not in the presentation range) > > * When PDFBox is drawing text on graphics canvas, it uses the unicode > value in a string and calls "PDSimpleFont.drawStirng" method. > > * Since the single character is isolated, it is either not found in > the Font, or the isolated form (if present) is rendered. > > Example: > > You can check characters in the following address: > http://en.wikipedia.org/wiki/Arabic_characters_in_Unicode > > When there is a U+0647 character in the file ( ه ), and should be > connected to the character before it, it should appear as U+FEEA ( ﻪ > ). > In the attached PDF file, this character appears in two different > fonts. Internal PDF code for the this character in the fonts are > "00C4" and "03EA". > > When I set a breakpoint in "PDSimpleFont.drawStirng" method, and > manually replace the string content with the appropriate presentation > form (like "\ufeea" for the above character) everything else works > fine and the output image is correct (it is found in the Font, where > the original character, "\u0647", is not embedded in the font). > > PDF viewers have some way of figuring out the presentation forms, > because the PDF is displayed correctly in all viewers. > > But I could not find out how can I determine which character code > should be mapped to which presentation form. I'm not very familiar > with the internals of PDF file, if any of the developers can guide me > on where to look next, I'd hopefully be able to figure out a way to > fix this. > > Thanks in advance > Hamed >

