Re: [poppler] How to normalize MathematicalPi text?

2019-03-13 Thread Jason Crain
On Wed, Mar 13, 2019 at 01:54:26PM +0100, Jeroen Ooms wrote:
> I think what would be needed is to construct a table that maps the
> Mathematical-Pi characters into their proper unicode values.

The PDF creator should be providing that table, called the ToUnicode
map, in the font's data structures. Since this font doesn't provide one,
poppler has to guess what the Unicode value could be and it guesses
wrong.

If you were to provide a map that says, for this font, character code
"^A" maps to "β", that should work.
___
poppler mailing list
poppler@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] How to normalize MathematicalPi text?

2019-03-13 Thread Jeroen Ooms
A researcher who is using the R bindings to analyze large numbers of
scientific papers has asked me advice on the following:

When extracting results from scientific pdf, sometimes math symbols
cannot be extracted because symbols are encoded with a custom font
called Mathematical-Pi [1]. An example of such a paper is [2]. When we
extract text via poppler::page::text() all of the = < > α β characters
are random characters from Mathematical-Pi rather than the expected
unicode symbols. Unfortunately these are critical characters to
interpret the results, so we cannot ignore this.

I was wondering if someone has experience with normalizing text with
custom fonts into proper unicode ?

I think what would be needed is to construct a table that maps the
Mathematical-Pi characters into their proper unicode values. Then we
would need some hook for poppler::page::text() to replace textboxes
that are using the Mathematical-Pi font, into the corresponding utf-8
text.


 [1] https://files.acrobat.com/a/preview/b445ea2f-fcbb-44af-a798-fc854d8dd9b5
 [2] https://github.com/ropensci/pdftools/files/2961444/Ames2004.pdf
___
poppler mailing list
poppler@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/poppler