Re: [CODE4LIB] pdf2txt [tesseract]

Eric Lease Morgan Thu, 17 Oct 2013 06:44:15 -0700

On Oct 16, 2013, at 10:56 AM, Robert Haschart <[email protected]> wrote:


> The abstract extraction routine I have been working on does use 
> tesseract internally for doing OCR when it encounters a document that 
> doesn't have usable full-text.  I agree that tesseract is not that easy 
> to install, especially if (as in my case) you do not have root/sudo 
> access to the machine.  Since I have gone through installing tesseract 
> quite recently, perhaps my experience can be helpful to you.


Robert, can you outline the process you used to get Tesseract to do OCR agains 
PDF documents? I installed Tesseract a few months ago, but I couldn't figure 
out how to get to work against PDF, only some image files. Any pointers would 
be greatly appreciated. (Hmmm. Maybe Tesseract doesn't do PDF files, only image 
files, and I need to convert my PDFs to images, and then the to Tesseract.) 
--Eric Morgan

Re: [CODE4LIB] pdf2txt [tesseract]

Reply via email to