Re: [CODE4LIB] pdf2txt

Robert Haschart Wed, 16 Oct 2013 07:57:37 -0700

On 10/15/2013 12:25 PM, Eric Lease Morgan wrote:

On Oct 14, 2013, at 4:49 PM, Robert Haschart<[email protected]>  wrote:

For a limited period of time I am making publicly available a Web-based program 
called PDF2TXT --http://bit.ly/1bJRyh8

Although based on some subsequent messages where you mention tesseract
maybe I misunderstood and your tool only handles pdfs that have already
been OCR'ed which would explain why the second document (which only
contains page images) fails.

Robert, that's correct. As of right now the document needs to have been 
previously OCRed. --Eric

The abstract extraction routine I have been working on does usetesseract internally for doing OCR when it encounters a document thatdoesn't have usable full-text. I agree that tesseract is not that easyto install, especially if (as in my case) you do not have root/sudoaccess to the machine. Since I have gone through installing tesseractquite recently, perhaps my experience can be helpful to you.


-Bob Haschart

Re: [CODE4LIB] pdf2txt

Reply via email to