I agree with Tim's analysis. Many "legacy" fonts (including unfortunately some of those used by LaTeX) are not mapped onto Unicode. There are two indications (codepoints and names which can often be used to create a partial mapping. I spent a *lot* of time doing this manually. For example >>> WARN No Unicode mapping for .notdef (89) in font null
WARN No Unicode mapping for 90 (90) in font null <<< The first field is the name , the second the codepoint. In your example the font (probably) uses codepoints consistently within that particular font, e.g. 89 is consistently the same character and different from 90. The names *may* differentiate characters. Here is my (handedited) entry for CMSY (used by LaTeX for symbols): <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/> But this will only work for this particularly font. If you are only dealing with anglophone alphanumeric from a single source/font you can probably work out a table. You are welcome to use mine (mainly from scientific / technical publishing) Beyond that OCR/Tesseract may help. (I use it a lot). However maths and non-ISO-LATIN is problematic. For example distinguishing between the many types of dash/minus/underline depend on having a system trained on these. Relative heights and size are a major problem In general, typesetters and their software are only concerned with the visual display and frequently use illiteracies (e.g. "=" + backspace + "/" for "not-equals". Anyone having work typeset in PDF should insist that a Unicode font is used. Better still avoid PDF. -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

