Hi Peter, Thank you very much for the reply. Unfortunately, the image I am dealing are the scanned one.
I will update my result if I succeed in using the mentioned line detection algorithms. Thanks & Regards, Kishore Babu I Developer email: [email protected] office: 040.66417681 www.envistacorp.com Subscribe to enVista's Newsletter! -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Peter Murray-Rust Sent: Saturday, 13 October, 2012 1:05 AM To: [email protected] Subject: Re: extracting text from image using pdfbox On Fri, Oct 12, 2012 at 2:47 PM, Kishore Babu <[email protected]> wrote: > Hi All,**** > > Is it possible to extract text from an image (JPEG) using pdfbox or is > there any open source java code for this?**** > > ** ** > > This is a very difficult problem and to solve it completely requires a large amount of applied artificial intelligence. There are no out-of-the box answers. However in limited domains there may be heuristic solutions. I am doing exactly this for scientific diagrams (and using PDFBox for parts of this) as an Open Source project. The project will go best when: * there are lots of diagrams relating to the same subject * the graphics strokes and characters are preserved as PDF primitives (paths and characters) * the characters are in common simple fonts (e.g. Helvetica) This we now have tools which will extract and interpret chemical structures and scientific diagrams (graphs) with a promising degree of precision. If the characters are present as bitmaps then it is much harder. OCR works best when: * the fonts are simple and well-known * there is clear whitespace between the characters * the characters are aligned with the page axes and are not distorted * there is no lossy compression algorithm. I am going to attempt to decipher images in PDFs using PDFBox to extract the images and then line detection algorithms such as http://en.wikipedia.org/wiki/Canny_edge_detector to fine lines and characters. I am optimistic of significant progress but it will be slow and will require heuristics. The things that make the process harder or impossible are: * scanned images - the images are often skewed and have variable contrast * lossy compression such as JPEG. (Look at the JPEG and you will see small satellite pixels from the wavelet algorithm. These make OCR much harder. BTW if any other reader is interested in hacking (STM) scientific technical medical PDFs using Java code layered on PDFBox and prepared to put in effort at alpha level I'd be delighted to hear from you. But it is *alpha* at best - there are a lot of heuristics that change frequently. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

