Completely agree with Tilman
I've made a large start with over 100 fonts (mainly from
science/tech/eng/math). See
https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml
and many more in
https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/
<https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml>

Here's a typical one - from the Mathematical PI range:9 (apologies for
formatting)

<!-- NOT UNICODE -->
<codePoint unicode="U+2264" decimal="35" name="numbersign" note="LESS-THAN
OR EQUAL TO"/>
<codePoint unicode="U+2265" decimal="36" name="dollar" note="GREATER-THAN
OR EQUAL TO"/>
<codePoint unicode="U+2245" decimal="38" name="ampersand" note="APPROXIMATELY
EQUAL TO"/>
<codePoint unicode="U+003C" decimal="44" name="comma" note="LESS-THAN SIGN"
/>
Note how the codepoint and name have no relation to the glyph.

Many of these fonts are proprietary and so impossible to obtain.

I'd be happy to hear of others prepared to help with managing these - I've
spent months...



On Wed, Apr 3, 2019 at 5:52 PM Tilman Hausherr <[email protected]>
wrote:

> Am 02.04.2019 um 03:59 schrieb Tim Allison:
> > Again, short of AI, your best bet is to run OCR (tesseract) on these
> files.
>
>
> Another possible idea: create a huge database of fonts names, glyph
> paths (or a hash of it) and unicodes.
>
> One could create such a database by using "good" pdfs as source, or
> (more simple) by just getting the original fonts and going though them.
>
> The main problem might be that such a database is possibly huge or too
> slow. But it would bring better results than OCR.
>
> Tilman
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to