Adding per-character OCR to Poppler

Gans, Jason David Thu, 29 Jan 2026 11:29:57 -0800

Hello Poppler project,

I have been working towards a solution for extracting text from PDF files that 
contain embedded Unicode values that do not match rendered glyphs. This idea 
was mentioned in the Poppler mailing lists back in 2012 
(https://lists.freedesktop.org/archives/poppler/2012-April/009035.html), but I 
couldn’t find any information suggesting that it was implemented and tested.


I have posted an experimental version of Poppler (“Poppler-science”; 
https://github.com/lanl/poppler-science) that has been modified to include a 
multilayer perceptron to decode font glyph symbols that are commonly used in 
the scientific literature. I would appreciate any feedback from the Poppler 
community and any suggestions for improvements!

Regards,

Jason Gans

Bioscience Division
Los Alamos National Laboratory

Adding per-character OCR to Poppler

Reply via email to