RE: extracting text from image using pdfbox

Kishore Babu Sun, 14 Oct 2012 22:59:29 -0700

Thanks Jeremias I will try it.

Regards, 

Kishore Babu I Developer 
email: [email protected]
office: 040.66417681
www.envistacorp.com
Subscribe to enVista's Newsletter!

-----Original Message-----
From: Jeremias Maerki [mailto:[email protected]] 
Sent: Sunday, 14 October, 2012 1:39 PM
To: [email protected]
Subject: Re: extracting text from image using pdfbox

Hi,
Apache PDFBox can't help you here, I'm afraid. What you're after is OCR 
functionality (http://en.wikipedia.org/wiki/Optical_character_recognition)
and PDFBox doesn't provide that. The only thing you can do is to extract the 
bitmap images using PDFBox and then attempt to decipher the text contained in 
them using an external OCR process. Just a warning: don't expect an OCR process 
to be 100% accurate.

If you're looking for an open source OCR engine, Tesseract is probably the most 
popular one: http://en.wikipedia.org/wiki/Tesseract_%28software%29

HTH
Jeremias Maerki

On 12.10.2012 15:47:40 Kishore Babu wrote:
> Hi All,
> Is it possible to extract text from an image (JPEG) using pdfbox or is there 
> any open source java code for this?
> 
> When I try to  convert pdf to text, it is showing blank output. Then I 
> converted into JPEG image. The image contains the text properly, which I am 
> failing to extract.
> 
> For normal pdf documents I am extracting text nicely using the standard 
> process but when the pdf document is an image, I am failing to extract the 
> text that is present in the image.
> 
> Can anyone give directions on this, please?
> 
> Thanks in advance.
> 
> Regards,
> Kishore Babu I Developer

RE: extracting text from image using pdfbox

Reply via email to