Completely agree with Tilman I've made a large start with over 100 fonts (mainly from science/tech/eng/math). See https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml and many more in https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/ <https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml>
Here's a typical one - from the Mathematical PI range:9 (apologies for formatting) <!-- NOT UNICODE --> <codePoint unicode="U+2264" decimal="35" name="numbersign" note="LESS-THAN OR EQUAL TO"/> <codePoint unicode="U+2265" decimal="36" name="dollar" note="GREATER-THAN OR EQUAL TO"/> <codePoint unicode="U+2245" decimal="38" name="ampersand" note="APPROXIMATELY EQUAL TO"/> <codePoint unicode="U+003C" decimal="44" name="comma" note="LESS-THAN SIGN" /> Note how the codepoint and name have no relation to the glyph. Many of these fonts are proprietary and so impossible to obtain. I'd be happy to hear of others prepared to help with managing these - I've spent months... On Wed, Apr 3, 2019 at 5:52 PM Tilman Hausherr <[email protected]> wrote: > Am 02.04.2019 um 03:59 schrieb Tim Allison: > > Again, short of AI, your best bet is to run OCR (tesseract) on these > files. > > > Another possible idea: create a huge database of fonts names, glyph > paths (or a hash of it) and unicodes. > > One could create such a database by using "good" pdfs as source, or > (more simple) by just getting the original fonts and going though them. > > The main problem might be that such a database is possibly huge or too > slow. But it would bring better results than OCR. > > Tilman > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

