Text extraction and wrong or missing ToUnicode maps

Wulf Berschin Wed, 13 Jul 2011 07:33:38 -0700

Hi,

when extracting a bunch of PDF documents in several languages I wonderedwhy some special characters in some documents where wrong in theextracted text files.

As it turns out these wrong-decoded PDFs have no or flawed ToUnicodedictionaries. The fonts are TrueTypes and always embedded,,,


Does somebody knows

- at what circumstances PDF with no or incorrect CMaps are created

- how could I work around this problem?

Since I have the TTFs: could I preload them? Otherwise: Could I correctthe PDFs by replacing the wrong / adding a correct CMap


Thank you for your help.

Wulf

Text extraction and wrong or missing ToUnicode maps

Reply via email to