Hi,
when extracting a bunch of PDF documents in several languages I wondered
why some special characters in some documents where wrong in the
extracted text files.
As it turns out these wrong-decoded PDFs have no or flawed ToUnicode
dictionaries. The fonts are TrueTypes and always embedded,,,
Does somebody knows
- at what circumstances PDF with no or incorrect CMaps are created
- how could I work around this problem?
Since I have the TTFs: could I preload them? Otherwise: Could I correct
the PDFs by replacing the wrong / adding a correct CMap
Thank you for your help.
Wulf