Re: No Unicode mapping for xx (xx) in font null

Tilman Hausherr Thu, 04 Apr 2019 13:02:17 -0700

Am 04.04.2019 um 14:59 schrieb Tim Allison:

One could create such a database by using "good" pdfs as source, or
(more simple) by just getting the original fonts and going though them.

This has occurred to me, and I'm happy to hear that Peter has been
making headway on this option.


Some questions:
1) Where are the hooks in PDFBox to load an external font (or better,
if sufficient: the codepoint mappings) gathered via this method?
Would we need the full font, or could we inject only codepoint
mappings (some spaces are calculated by character width/distance from
last character, so we'd need the font info, right)?

I haven't understood all this (but it's late here). I also haven'tunderstood Peter's text.

To get a TTF font in fontbox: new TTFParser().parse(). From there youcan access the path, the unicode, etc. For a better understanding, loadthe font into DTL OTMaster 3.7 light and look at the tables. FontForgesucks IMHO, its gui is terrible.


For type1 fonts I would have to search...

To get the path of a glyph you hit in PDFBox - see theDrawPrintTextLocations.java example, search for "cyan". The methods aredifferent for each font type.

With "good" PDFs you don't need to access fontbox directly, you coulduse the unicode given by the stripper together with the path and thenbuild a table from that.

But the question is, can/should we use the drawing path as a key or isthere something else that is unique and that would work with a subsettedfont? Peter what is your "key" to get the unicode?


Tilman


2) We can't rely on fonts having unique names across a corpus...right?
  How would we pick from multiple options with the same name -- OOV%?

3) If one happened to have ~500k PDFs available, is there example code
of how to pull out codepoint mappings/fonts with PDFBox?  This study
might give some indication of feasibility of this approach across a
heterogeneous corpus.

Cheers,

            Tim

On Thu, Apr 4, 2019 at 7:58 AM Peter Murray-Rust <[email protected]> wrote:

Completely agree with Tilman
I've made a large start with over 100 fonts (mainly from
science/tech/eng/math). See
https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml
and many more in
https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/
<https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml>

Here's a typical one - from the Mathematical PI range:9 (apologies for
formatting)

<!-- NOT UNICODE -->
<codePoint unicode="U+2264" decimal="35" name="numbersign" note="LESS-THAN
OR EQUAL TO"/>
<codePoint unicode="U+2265" decimal="36" name="dollar" note="GREATER-THAN
OR EQUAL TO"/>
<codePoint unicode="U+2245" decimal="38" name="ampersand" note="APPROXIMATELY
EQUAL TO"/>
<codePoint unicode="U+003C" decimal="44" name="comma" note="LESS-THAN SIGN"
/>
Note how the codepoint and name have no relation to the glyph.

Many of these fonts are proprietary and so impossible to obtain.

I'd be happy to hear of others prepared to help with managing these - I've
spent months...



On Wed, Apr 3, 2019 at 5:52 PM Tilman Hausherr <[email protected]>
wrote:

Am 02.04.2019 um 03:59 schrieb Tim Allison:

Again, short of AI, your best bet is to run OCR (tesseract) on these

files.


Another possible idea: create a huge database of fonts names, glyph
paths (or a hash of it) and unicodes.

One could create such a database by using "good" pdfs as source, or
(more simple) by just getting the original fonts and going though them.

The main problem might be that such a database is possibly huge or too
slow. But it would bring better results than OCR.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

--
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: No Unicode mapping for xx (xx) in font null

Reply via email to