Re: OCR and PDFBox/PDF2SVG

Maruan Sahyoun Tue, 22 Apr 2014 09:05:57 -0700

Hi Peter,

PDFBOX-1912 is an effort to add OCR to PDFBox as part of a GSoC engagement.


Maybe that’s what you are looking for?

BR
Maruan

Am 22.04.2014 um 15:39 schrieb Peter Murray-Rust <[email protected]>:

> We have a need to carry out limited OCR in the PDF extraction process and
> are thinking of adding it to PDF2SVG (
> https://bitbucket.org/petermr/pdf2svg-dev/wiki/Home - our converter based
> on PDFBox). In our work (converting technical documents and scientific
> publications) there are two particular areas:
> 
> * when unknown and non-conformance font families are used. This is
> unfortunately extremely common (most scientific publishers use non-Unicode
> undocumented fonts). Our approach is to carry out "OCR" on the glyphs in
> the font maps.
> 
> * in binarized image diagrams (e.g. plots), where characters in a (fairly
> small) range of fonts are used (code points mainly in the ASCII range).
> 
> There seems to be no pure Java F/OSS OCR software that can be easily used
> with PDFBox and PDF2SVG. We are therefore hacking our own and using bits of
> "javaOCR" (http://sourceforge.net/projects/javaocr/ - which seems a stalled
> project ) and Longan (https://github.com/Zarkonnen/Longan/tree/master/src -
> the author has recently mailed me and is interested in resuming the work).
> We also have our own approach which involves thinning and topological
> analysis.
> 
> This mail is to see if others either have a solution (which would save us
> going further) or to see if anyone is interested in using such a facility
> 
> [Note that this is feasible mainly because the source is born-digital and
> binarized (0/1) and so does not suffer from scanning artefacts such as
> skewing, contrast, noise, etc.]
> 
> P.
> 
> -- 
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

Re: OCR and PDFBox/PDF2SVG

Reply via email to