A researcher who is using the R bindings to analyze large numbers of scientific papers has asked me advice on the following:
When extracting results from scientific pdf, sometimes math symbols cannot be extracted because symbols are encoded with a custom font called Mathematical-Pi [1]. An example of such a paper is [2]. When we extract text via poppler::page::text() all of the = < > α β characters are random characters from Mathematical-Pi rather than the expected unicode symbols. Unfortunately these are critical characters to interpret the results, so we cannot ignore this. I was wondering if someone has experience with normalizing text with custom fonts into proper unicode ? I think what would be needed is to construct a table that maps the Mathematical-Pi characters into their proper unicode values. Then we would need some hook for poppler::page::text() to replace textboxes that are using the Mathematical-Pi font, into the corresponding utf-8 text. [1] https://files.acrobat.com/a/preview/b445ea2f-fcbb-44af-a798-fc854d8dd9b5 [2] https://github.com/ropensci/pdftools/files/2961444/Ames2004.pdf _______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler