Re: [poppler] How to normalize MathematicalPi text?
On Wed, Mar 13, 2019 at 01:54:26PM +0100, Jeroen Ooms wrote: > I think what would be needed is to construct a table that maps the > Mathematical-Pi characters into their proper unicode values. The PDF creator should be providing that table, called the ToUnicode map, in the font's data structures. Since this font doesn't provide one, poppler has to guess what the Unicode value could be and it guesses wrong. If you were to provide a map that says, for this font, character code "^A" maps to "β", that should work. ___ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler
[poppler] How to normalize MathematicalPi text?
A researcher who is using the R bindings to analyze large numbers of scientific papers has asked me advice on the following: When extracting results from scientific pdf, sometimes math symbols cannot be extracted because symbols are encoded with a custom font called Mathematical-Pi [1]. An example of such a paper is [2]. When we extract text via poppler::page::text() all of the = < > α β characters are random characters from Mathematical-Pi rather than the expected unicode symbols. Unfortunately these are critical characters to interpret the results, so we cannot ignore this. I was wondering if someone has experience with normalizing text with custom fonts into proper unicode ? I think what would be needed is to construct a table that maps the Mathematical-Pi characters into their proper unicode values. Then we would need some hook for poppler::page::text() to replace textboxes that are using the Mathematical-Pi font, into the corresponding utf-8 text. [1] https://files.acrobat.com/a/preview/b445ea2f-fcbb-44af-a798-fc854d8dd9b5 [2] https://github.com/ropensci/pdftools/files/2961444/Ames2004.pdf ___ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler