[poppler] How to normalize MathematicalPi text?

Jeroen Ooms Wed, 13 Mar 2019 05:55:08 -0700

A researcher who is using the R bindings to analyze large numbers of
scientific papers has asked me advice on the following:


When extracting results from scientific pdf, sometimes math symbols
cannot be extracted because symbols are encoded with a custom font
called Mathematical-Pi [1]. An example of such a paper is [2]. When we
extract text via poppler::page::text() all of the = < > α β characters
are random characters from Mathematical-Pi rather than the expected
unicode symbols. Unfortunately these are critical characters to
interpret the results, so we cannot ignore this.

I was wondering if someone has experience with normalizing text with
custom fonts into proper unicode ?

I think what would be needed is to construct a table that maps the
Mathematical-Pi characters into their proper unicode values. Then we
would need some hook for poppler::page::text() to replace textboxes
that are using the Mathematical-Pi font, into the corresponding utf-8
text.


 [1] https://files.acrobat.com/a/preview/b445ea2f-fcbb-44af-a798-fc854d8dd9b5
 [2] https://github.com/ropensci/pdftools/files/2961444/Ames2004.pdf
_______________________________________________
poppler mailing list
poppler@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] How to normalize MathematicalPi text?

Reply via email to