Re: Text extraction and wrong or missing ToUnicode maps

Wulf Berschin Thu, 04 Aug 2011 05:14:09 -0700

Well, it turned out that a wrong CMap in the (truetype) font was theorigin of this problem. When copy-pasting text from adobe reader thesame artifacts where shown...In the faulty font Glyph #218 (sacute) was mapped to U+0153 (oelig) andU+0158 (sacute) but, fortunately the postscript names in the glyf tablewere present so I could find out the correct unicode (via adobesglyphmap, sacute=u+0153) and apply a correction mapping onPDFTextStrippers' text output.


Font CMap: Unicode -> Index
Font Glyf Table: Index -> PSName
Adobe Glyphlist PSName -> Unicode


Wulf


Am 13.07.2011 16:33, schrieb Wulf Berschin:

Hi,

when extracting a bunch of PDF documents in several languages I wondered
why some special characters in some documents where wrong in the
extracted text files.

As it turns out these wrong-decoded PDFs have no or flawed ToUnicode
dictionaries. The fonts are TrueTypes and always embedded,,,

Does somebody knows

- at what circumstances PDF with no or incorrect CMaps are created

- how could I work around this problem?
Since I have the TTFs: could I preload them? Otherwise: Could I correct
the PDFs by replacing the wrong / adding a correct CMap

Thank you for your help.

Wulf

Re: Text extraction and wrong or missing ToUnicode maps

Reply via email to