On 10/15/2013 12:25 PM, Eric Lease Morgan wrote:
On Oct 14, 2013, at 4:49 PM, Robert Haschart<[email protected]> wrote:
For a limited period of time I am making publicly available a Web-based program
called PDF2TXT --http://bit.ly/1bJRyh8
Although based on some subsequent messages where you mention tesseract
maybe I misunderstood and your tool only handles pdfs that have already
been OCR'ed which would explain why the second document (which only
contains page images) fails.
Robert, that's correct. As of right now the document needs to have been
previously OCRed. --Eric
The abstract extraction routine I have been working on does use
tesseract internally for doing OCR when it encounters a document that
doesn't have usable full-text. I agree that tesseract is not that easy
to install, especially if (as in my case) you do not have root/sudo
access to the machine. Since I have gone through installing tesseract
quite recently, perhaps my experience can be helpful to you.
-Bob Haschart