Thanks, could be useful. (Note it hasn't started - presumably until summer).
We'll probably bash ahead anyway as we have to do other things and keep in touch On Tue, Apr 22, 2014 at 5:04 PM, Maruan Sahyoun <[email protected]>wrote: > Hi Peter, > > PDFBOX-1912 is an effort to add OCR to PDFBox as part of a GSoC engagement. > > Maybe that’s what you are looking for? > > BR > Maruan > > Am 22.04.2014 um 15:39 schrieb Peter Murray-Rust <[email protected]>: > > > We have a need to carry out limited OCR in the PDF extraction process and > > are thinking of adding it to PDF2SVG ( > > https://bitbucket.org/petermr/pdf2svg-dev/wiki/Home - our converter > based > > on PDFBox). In our work (converting technical documents and scientific > > publications) there are two particular areas: > > > > * when unknown and non-conformance font families are used. This is > > unfortunately extremely common (most scientific publishers use > non-Unicode > > undocumented fonts). Our approach is to carry out "OCR" on the glyphs in > > the font maps. > > > > * in binarized image diagrams (e.g. plots), where characters in a (fairly > > small) range of fonts are used (code points mainly in the ASCII range). > > > > There seems to be no pure Java F/OSS OCR software that can be easily used > > with PDFBox and PDF2SVG. We are therefore hacking our own and using bits > of > > "javaOCR" (http://sourceforge.net/projects/javaocr/ - which seems a > stalled > > project ) and Longan ( > https://github.com/Zarkonnen/Longan/tree/master/src - > > the author has recently mailed me and is interested in resuming the > work). > > We also have our own approach which involves thinning and topological > > analysis. > > > > This mail is to see if others either have a solution (which would save us > > going further) or to see if anyone is interested in using such a facility > > > > [Note that this is feasible mainly because the source is born-digital and > > binarized (0/1) and so does not suffer from scanning artefacts such as > > skewing, contrast, noise, etc.] > > > > P. > > > > -- > > Peter Murray-Rust > > Reader in Molecular Informatics > > Unilever Centre, Dep. Of Chemistry > > University of Cambridge > > CB2 1EW, UK > > +44-1223-763069 > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

