Am 04.04.2019 um 14:59 schrieb Tim Allison:
One could create such a database by using "good" pdfs as source, or
(more simple) by just getting the original fonts and going though them.
This has occurred to me, and I'm happy to hear that Peter has been
making headway on this option.

Some questions:
1) Where are the hooks in PDFBox to load an external font (or better,
if sufficient: the codepoint mappings) gathered via this method?
Would we need the full font, or could we inject only codepoint
mappings (some spaces are calculated by character width/distance from
last character, so we'd need the font info, right)?

I haven't understood all this (but it's late here). I also haven't understood Peter's text.

To get a TTF font in fontbox: new TTFParser().parse(). From there you can access the path, the unicode, etc. For a better understanding, load the font into DTL OTMaster 3.7 light and look at the tables. FontForge sucks IMHO, its gui is terrible.

For type1 fonts I would have to search...

To get the path of a glyph you hit in PDFBox - see the DrawPrintTextLocations.java example, search for "cyan". The methods are different for each font type.

With "good" PDFs you don't need to access fontbox directly, you could use the unicode given by the stripper together with the path and then build a table from that.

But the question is, can/should we use the drawing path as a key or is there something else that is unique and that would work with a subsetted font? Peter what is your "key" to get the unicode?

Tilman



2) We can't rely on fonts having unique names across a corpus...right?
  How would we pick from multiple options with the same name -- OOV%?

3) If one happened to have ~500k PDFs available, is there example code
of how to pull out codepoint mappings/fonts with PDFBox?  This study
might give some indication of feasibility of this approach across a
heterogeneous corpus.

Cheers,

            Tim

On Thu, Apr 4, 2019 at 7:58 AM Peter Murray-Rust <[email protected]> wrote:
Completely agree with Tilman
I've made a large start with over 100 fonts (mainly from
science/tech/eng/math). See
https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml
and many more in
https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/
<https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml>

Here's a typical one - from the Mathematical PI range:9 (apologies for
formatting)

<!-- NOT UNICODE -->
<codePoint unicode="U+2264" decimal="35" name="numbersign" note="LESS-THAN
OR EQUAL TO"/>
<codePoint unicode="U+2265" decimal="36" name="dollar" note="GREATER-THAN
OR EQUAL TO"/>
<codePoint unicode="U+2245" decimal="38" name="ampersand" note="APPROXIMATELY
EQUAL TO"/>
<codePoint unicode="U+003C" decimal="44" name="comma" note="LESS-THAN SIGN"
/>
Note how the codepoint and name have no relation to the glyph.

Many of these fonts are proprietary and so impossible to obtain.

I'd be happy to hear of others prepared to help with managing these - I've
spent months...



On Wed, Apr 3, 2019 at 5:52 PM Tilman Hausherr <[email protected]>
wrote:

Am 02.04.2019 um 03:59 schrieb Tim Allison:
Again, short of AI, your best bet is to run OCR (tesseract) on these
files.


Another possible idea: create a huge database of fonts names, glyph
paths (or a hash of it) and unicodes.

One could create such a database by using "good" pdfs as source, or
(more simple) by just getting the original fonts and going though them.

The main problem might be that such a database is possibly huge or too
slow. But it would bring better results than OCR.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


--
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to