Hi, Hmm. Could you try adding tesseract to your PATH? How did you install Tesseract? You should be able to do a straightforward `sudo apt-get install tesseract-ocr`. After that, the OCR tests should pass. We're still running into TIKA-1422, where a mail test fails. But, you can run just the OCR tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest -DfailIfNoTests=false`.
Let me know if that works for you! Tyler On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <kslo...@gmail.com> wrote: > I am working on ubuntu 10.4. and I am having some trouble. > Tesseract is installed correctly, but just doing a clone from the repo and > installing with maven, I am getting some errors. > > This is before I did anything with tesseract installed. > > Failed tests: testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest): > Check for the image's text. > testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest) > testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest) > > Next I hard coded the tesseractPath: > > I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.' > The all tests passed and it built successfully, but then I went to post > some tiff's to the server. > That didn't work. So I tried adding some System.out.println("hello world") > (a little crude I know) inside the unit tests to confirm that tesseract > was working correctly. It looks like something happens in the unit test in > TesseractOCRTest.java > on the line that says TesseractOCRConfig config = new > TesseractOCRConfig();. Printing to stdout before works, but I get nothing > after. That happens before the assumeTrue(canRun(config));. So an exception > is not get raised. > > Then once everything is built, ocr does not work. That was why I figured I > would ask to see if I missed some sort of configuration step in building > it. > > Thanks a ton. > > > > > > On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) < > chris.a.mattm...@jpl.nasa.gov> wrote: > > > Dear Kevin, > > > > Sure, it already works :) 1.7-SNAPSHOT. > > > > See this wiki page: > > > > https://wiki.apache.org/tika/TikaOCR > > > > I¹d be happy to discuss more. > > > > Thanks! > > > > Cheers, > > Chris > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Chief Architect > > Instrument Software and Science Data Systems Section (398) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 168-519, Mailstop: 168-527 > > Email: chris.a.mattm...@nasa.gov > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Associate Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > > > > -----Original Message----- > > From: kevin slote <kslo...@gmail.com> > > Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> > > Date: Tuesday, September 30, 2014 at 8:52 AM > > To: "dev@tika.apache.org" <dev@tika.apache.org> > > Subject: OCR with tika-server > > > > >Hello all, > > > > > >I have been testing out the integration of tika with tesseract. > > >I was wondering if there is a way to get tika-server to run with > > >tesseract's OCR capabilities? > > > > > >Best > > > > > >Kevin Slote > > > > >