Hi, On Wed, Sep 17, 2008 at 11:38 PM, Brian Carrier <[EMAIL PROTECTED]> wrote: > My current solution is a two phased approach: > 1) PDFStreamEngine.showString(): Has a hard coded list of fonts that are > known to use the 3 character form and replaces U+FDF2 with lam-lam-heh. For > other fonts, U+FDF2 remains. > 2) PDFTextStripper.flushText(): Looks for the sequence of alef plus the > U+FDF2 ligature. This should happen only if the U+FDF2 is supposed to > replaced by only lam-lam-heh. So, we remove the extra alef. This is a > fallback for fonts that should have been caught in step 1, but that we do > not know about. > > This works for our test files, but it does not seem like the cleanest > solution. Ideally, we should be looking at each font and querying it to see > if was going to replace the U+FDF2 with lam-lam-heh or alef-lam-lam-heh. > Anyone know if this is possible in PDFBox?
The information should be available since you _are_ getting the lam-lam-heh sequence out of the font. For example, when a font is loaded (or first used) we could automatically try to encode the U+FDF2 character and record the result somewhere. Instead of modifying PDFStreamEngine and PDFTextStripper, would it make more sense to encapsulate this logic already in PDFont.encode()? This way the logic would be where it should be, i.e. associated with the font. Disclaimer: I'm no PDFBox expert. BR, Jukka Zitting
