We have a need to carry out limited OCR in the PDF extraction process and are thinking of adding it to PDF2SVG ( https://bitbucket.org/petermr/pdf2svg-dev/wiki/Home - our converter based on PDFBox). In our work (converting technical documents and scientific publications) there are two particular areas:
* when unknown and non-conformance font families are used. This is unfortunately extremely common (most scientific publishers use non-Unicode undocumented fonts). Our approach is to carry out "OCR" on the glyphs in the font maps. * in binarized image diagrams (e.g. plots), where characters in a (fairly small) range of fonts are used (code points mainly in the ASCII range). There seems to be no pure Java F/OSS OCR software that can be easily used with PDFBox and PDF2SVG. We are therefore hacking our own and using bits of "javaOCR" (http://sourceforge.net/projects/javaocr/ - which seems a stalled project ) and Longan (https://github.com/Zarkonnen/Longan/tree/master/src - the author has recently mailed me and is interested in resuming the work). We also have our own approach which involves thinning and topological analysis. This mail is to see if others either have a solution (which would save us going further) or to see if anyone is interested in using such a facility [Note that this is feasible mainly because the source is born-digital and binarized (0/1) and so does not suffer from scanning artefacts such as skewing, contrast, noise, etc.] P. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

