> One could create such a database by using "good" pdfs as source, or
> (more simple) by just getting the original fonts and going though them.
This has occurred to me, and I'm happy to hear that Peter has been
making headway on this option.
Some questions:
1) Where are the hooks in PDFBox to load an external font (or better,
if sufficient: the codepoint mappings) gathered via this method?
Would we need the full font, or could we inject only codepoint
mappings (some spaces are calculated by character width/distance from
last character, so we'd need the font info, right)?
2) We can't rely on fonts having unique names across a corpus...right?
How would we pick from multiple options with the same name -- OOV%?
3) If one happened to have ~500k PDFs available, is there example code
of how to pull out codepoint mappings/fonts with PDFBox? This study
might give some indication of feasibility of this approach across a
heterogeneous corpus.
Cheers,
Tim
On Thu, Apr 4, 2019 at 7:58 AM Peter Murray-Rust <[email protected]> wrote:
>
> Completely agree with Tilman
> I've made a large start with over 100 fonts (mainly from
> science/tech/eng/math). See
> https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml
> and many more in
> https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/
> <https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml>
>
> Here's a typical one - from the Mathematical PI range:9 (apologies for
> formatting)
>
> <!-- NOT UNICODE -->
> <codePoint unicode="U+2264" decimal="35" name="numbersign" note="LESS-THAN
> OR EQUAL TO"/>
> <codePoint unicode="U+2265" decimal="36" name="dollar" note="GREATER-THAN
> OR EQUAL TO"/>
> <codePoint unicode="U+2245" decimal="38" name="ampersand" note="APPROXIMATELY
> EQUAL TO"/>
> <codePoint unicode="U+003C" decimal="44" name="comma" note="LESS-THAN SIGN"
> />
> Note how the codepoint and name have no relation to the glyph.
>
> Many of these fonts are proprietary and so impossible to obtain.
>
> I'd be happy to hear of others prepared to help with managing these - I've
> spent months...
>
>
>
> On Wed, Apr 3, 2019 at 5:52 PM Tilman Hausherr <[email protected]>
> wrote:
>
> > Am 02.04.2019 um 03:59 schrieb Tim Allison:
> > > Again, short of AI, your best bet is to run OCR (tesseract) on these
> > files.
> >
> >
> > Another possible idea: create a huge database of fonts names, glyph
> > paths (or a hash of it) and unicodes.
> >
> > One could create such a database by using "good" pdfs as source, or
> > (more simple) by just getting the original fonts and going though them.
> >
> > The main problem might be that such a database is possibly huge or too
> > slow. But it would bring better results than OCR.
> >
> > Tilman
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>
> --
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]