Hi, I'm trying to resolve PDFBOX-1216 that I reported a while ago by debugging the PDFBox source code, and I need some advice on what to do. In brief, the issue is that PDFBox doesn't use presentation forms when creating PDF images for Arabic / Persian text in PDF, thus the characters are shown disconnected. I'm not sure yet, but I guess this is called "ligature"?
Anyway, here's what I concluded so far, and if anyone could guide me, I may be able to fix this and provide a patch. * In PDF file, different codes are used for different presentation forms of a single unicode character (under Content stream of PDF file, under "TJ" command which is "show text, allowing individual glyph positioning") * In the "ToUnicode" table of PDF file (which is read into the "cmap" variable of PDFont class), all the presentation forms are mapped to the same unicode character (which is not in the presentation range) * When PDFBox is drawing text on graphics canvas, it uses the unicode value in a string and calls "PDSimpleFont.drawStirng" method. * Since the single character is isolated, it is either not found in the Font, or the isolated form (if present) is rendered. Example: You can check characters in the following address: http://en.wikipedia.org/wiki/Arabic_characters_in_Unicode When there is a U+0647 character in the file ( ه ), and should be connected to the character before it, it should appear as U+FEEA ( ﻪ ). In the attached PDF file, this character appears in two different fonts. Internal PDF code for the this character in the fonts are "00C4" and "03EA". When I set a breakpoint in "PDSimpleFont.drawStirng" method, and manually replace the string content with the appropriate presentation form (like "\ufeea" for the above character) everything else works fine and the output image is correct (it is found in the Font, where the original character, "\u0647", is not embedded in the font). PDF viewers have some way of figuring out the presentation forms, because the PDF is displayed correctly in all viewers. But I could not find out how can I determine which character code should be mapped to which presentation form. I'm not very familiar with the internals of PDF file, if any of the developers can guide me on where to look next, I'd hopefully be able to figure out a way to fix this. Thanks in advance Hamed

