Well, it turned out that a wrong CMap in the (truetype) font was the origin of this problem. When copy-pasting text from adobe reader the same artifacts where shown... In the faulty font Glyph #218 (sacute) was mapped to U+0153 (oelig) and U+0158 (sacute) but, fortunately the postscript names in the glyf table were present so I could find out the correct unicode (via adobes glyphmap, sacute=u+0153) and apply a correction mapping on PDFTextStrippers' text output.

Font CMap: Unicode -> Index
Font Glyf Table: Index -> PSName
Adobe Glyphlist PSName -> Unicode

Wulf


Am 13.07.2011 16:33, schrieb Wulf Berschin:
Hi,

when extracting a bunch of PDF documents in several languages I wondered
why some special characters in some documents where wrong in the
extracted text files.

As it turns out these wrong-decoded PDFs have no or flawed ToUnicode
dictionaries. The fonts are TrueTypes and always embedded,,,

Does somebody knows

- at what circumstances PDF with no or incorrect CMaps are created

- how could I work around this problem?
Since I have the TTFs: could I preload them? Otherwise: Could I correct
the PDFs by replacing the wrong / adding a correct CMap

Thank you for your help.

Wulf





Reply via email to