RE: extracting text from image using pdfbox

Kishore Babu Sun, 14 Oct 2012 22:17:49 -0700

Hi Peter, 
Thank you very much for the reply. Unfortunately, the image I am dealing are 
the scanned one.

I will update my result if I succeed in using the mentioned line detection 
algorithms. 

Thanks & Regards, 

Kishore Babu I Developer 
email: [email protected]
office: 040.66417681
www.envistacorp.com
Subscribe to enVista's Newsletter!

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Peter Murray-Rust
Sent: Saturday, 13 October, 2012 1:05 AM
To: [email protected]
Subject: Re: extracting text from image using pdfbox

On Fri, Oct 12, 2012 at 2:47 PM, Kishore Babu <[email protected]> wrote:

> Hi All,****
>
> Is it possible to extract text from an image (JPEG) using pdfbox or is 
> there any open source java code for this?****
>
> ** **
>
> This is a very difficult problem and to solve it completely requires a
large amount of applied artificial intelligence. There are no out-of-the box 
answers.

However in limited domains there may be heuristic solutions. I am doing exactly 
this for scientific diagrams (and using PDFBox for parts of this) as an Open 
Source project.  The project will go best when:
* there are lots of diagrams relating to the same subject
* the graphics strokes and characters are preserved as PDF primitives (paths 
and characters)
* the characters are in common simple fonts (e.g. Helvetica)

This we now have tools which will extract and interpret chemical structures and 
scientific diagrams (graphs) with a promising degree of precision.

If the characters are present as bitmaps then it is much harder. OCR works best 
when:
* the fonts are simple and well-known
* there is clear whitespace between the characters
* the characters are aligned with the page axes and are not distorted
* there is no lossy compression algorithm.

I am going to attempt to decipher images in PDFs using PDFBox to extract the 
images and then line detection algorithms such as 
http://en.wikipedia.org/wiki/Canny_edge_detector to fine lines and characters. 
I am optimistic of significant progress but it will be slow and will require 
heuristics.

The things that make the process harder or impossible are:
* scanned images - the images are often skewed and have variable contrast
* lossy compression such as JPEG. (Look at the JPEG and you will see small 
satellite pixels from the wavelet algorithm. These make OCR much harder.

BTW if any other reader is interested in hacking (STM) scientific technical 
medical PDFs using Java code layered on PDFBox and prepared to put in effort at 
alpha level I'd be delighted to hear from you. But it is *alpha* at best - 
there are a lot of heuristics that change frequently.

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

RE: extracting text from image using pdfbox

Reply via email to