Hi Peter, PDFBOX-1912 is an effort to add OCR to PDFBox as part of a GSoC engagement.
Maybe that’s what you are looking for? BR Maruan Am 22.04.2014 um 15:39 schrieb Peter Murray-Rust <[email protected]>: > We have a need to carry out limited OCR in the PDF extraction process and > are thinking of adding it to PDF2SVG ( > https://bitbucket.org/petermr/pdf2svg-dev/wiki/Home - our converter based > on PDFBox). In our work (converting technical documents and scientific > publications) there are two particular areas: > > * when unknown and non-conformance font families are used. This is > unfortunately extremely common (most scientific publishers use non-Unicode > undocumented fonts). Our approach is to carry out "OCR" on the glyphs in > the font maps. > > * in binarized image diagrams (e.g. plots), where characters in a (fairly > small) range of fonts are used (code points mainly in the ASCII range). > > There seems to be no pure Java F/OSS OCR software that can be easily used > with PDFBox and PDF2SVG. We are therefore hacking our own and using bits of > "javaOCR" (http://sourceforge.net/projects/javaocr/ - which seems a stalled > project ) and Longan (https://github.com/Zarkonnen/Longan/tree/master/src - > the author has recently mailed me and is interested in resuming the work). > We also have our own approach which involves thinning and topological > analysis. > > This mail is to see if others either have a solution (which would save us > going further) or to see if anyone is interested in using such a facility > > [Note that this is feasible mainly because the source is born-digital and > binarized (0/1) and so does not suffer from scanning artefacts such as > skewing, contrast, noise, etc.] > > P. > > -- > Peter Murray-Rust > Reader in Molecular Informatics > Unilever Centre, Dep. Of Chemistry > University of Cambridge > CB2 1EW, UK > +44-1223-763069

