I agree with Tim's analysis.

Many "legacy" fonts (including unfortunately some of those used by LaTeX)
are not mapped onto Unicode. There are two indications (codepoints and
names which can often be used to create a partial mapping. I spent a *lot*
of time doing this manually. For example
>>>
WARN  No Unicode mapping for .notdef (89) in font null

 WARN  No Unicode mapping for 90 (90) in font null
<<<
The first field is the name , the second the codepoint. In your example the
font (probably) uses codepoints consistently within that particular font,
e.g. 89 is consistently the same character and different from 90. The names
*may* differentiate characters. Here is my (handedited) entry for CMSY
(used by LaTeX for symbols):

<codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>

But this will only work for this particularly font.

If you are only dealing with anglophone alphanumeric from a single
source/font you can probably work out a table. You are welcome to use mine
(mainly from scientific / technical publishing) Beyond that OCR/Tesseract
may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
For example distinguishing between the many types of dash/minus/underline
depend on having a system trained on these. Relative heights and size are a
major problem

In general, typesetters and their software are only concerned with the
visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
for "not-equals". Anyone having work typeset in PDF should insist that a
Unicode font is used. Better still avoid PDF.



-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to