Re: OCR and PDFBox/PDF2SVG

Peter Murray-Rust Tue, 22 Apr 2014 09:15:14 -0700

Thanks, could be useful. (Note it hasn't started - presumably until summer).


We'll probably bash ahead anyway as we have to do other things and keep in
touch




On Tue, Apr 22, 2014 at 5:04 PM, Maruan Sahyoun <[email protected]>wrote:

> Hi Peter,
>
> PDFBOX-1912 is an effort to add OCR to PDFBox as part of a GSoC engagement.
>
> Maybe that’s what you are looking for?
>
> BR
> Maruan
>
> Am 22.04.2014 um 15:39 schrieb Peter Murray-Rust <[email protected]>:
>
> > We have a need to carry out limited OCR in the PDF extraction process and
> > are thinking of adding it to PDF2SVG (
> > https://bitbucket.org/petermr/pdf2svg-dev/wiki/Home - our converter
> based
> > on PDFBox). In our work (converting technical documents and scientific
> > publications) there are two particular areas:
> >
> > * when unknown and non-conformance font families are used. This is
> > unfortunately extremely common (most scientific publishers use
> non-Unicode
> > undocumented fonts). Our approach is to carry out "OCR" on the glyphs in
> > the font maps.
> >
> > * in binarized image diagrams (e.g. plots), where characters in a (fairly
> > small) range of fonts are used (code points mainly in the ASCII range).
> >
> > There seems to be no pure Java F/OSS OCR software that can be easily used
> > with PDFBox and PDF2SVG. We are therefore hacking our own and using bits
> of
> > "javaOCR" (http://sourceforge.net/projects/javaocr/ - which seems a
> stalled
> > project ) and Longan (
> https://github.com/Zarkonnen/Longan/tree/master/src -
> > the author has recently mailed me and is interested in resuming the
> work).
> > We also have our own approach which involves thinning and topological
> > analysis.
> >
> > This mail is to see if others either have a solution (which would save us
> > going further) or to see if anyone is interested in using such a facility
> >
> > [Note that this is feasible mainly because the source is born-digital and
> > binarized (0/1) and so does not suffer from scanning artefacts such as
> > skewing, contrast, noise, etc.]
> >
> > P.
> >
> > --
> > Peter Murray-Rust
> > Reader in Molecular Informatics
> > Unilever Centre, Dep. Of Chemistry
> > University of Cambridge
> > CB2 1EW, UK
> > +44-1223-763069
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: OCR and PDFBox/PDF2SVG

Reply via email to