We have a need to carry out limited OCR in the PDF extraction process and
are thinking of adding it to PDF2SVG (
https://bitbucket.org/petermr/pdf2svg-dev/wiki/Home - our converter based
on PDFBox). In our work (converting technical documents and scientific
publications) there are two particular areas:

* when unknown and non-conformance font families are used. This is
unfortunately extremely common (most scientific publishers use non-Unicode
undocumented fonts). Our approach is to carry out "OCR" on the glyphs in
the font maps.

* in binarized image diagrams (e.g. plots), where characters in a (fairly
small) range of fonts are used (code points mainly in the ASCII range).

There seems to be no pure Java F/OSS OCR software that can be easily used
with PDFBox and PDF2SVG. We are therefore hacking our own and using bits of
"javaOCR" (http://sourceforge.net/projects/javaocr/ - which seems a stalled
project ) and Longan (https://github.com/Zarkonnen/Longan/tree/master/src -
 the author has recently mailed me and is interested in resuming the work).
We also have our own approach which involves thinning and topological
analysis.

This mail is to see if others either have a solution (which would save us
going further) or to see if anyone is interested in using such a facility

[Note that this is feasible mainly because the source is born-digital and
binarized (0/1) and so does not suffer from scanning artefacts such as
skewing, contrast, noise, etc.]

P.

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to