Re: OCR with tika-server

Tyler Palsulich Tue, 30 Sep 2014 13:14:45 -0700

Hi,

Hmm. Could you try adding tesseract to your PATH? How did you install
Tesseract? You should be able to do a straightforward `sudo apt-get install
tesseract-ocr`. After that, the OCR tests should pass. We're still running
into TIKA-1422, where a mail test fails. But, you can run just the OCR
tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
-DfailIfNoTests=false`.


Let me know if that works for you!
Tyler

On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]> wrote:

> I am working on ubuntu 10.4. and I am having some trouble.
> Tesseract is installed correctly, but just doing a clone from the repo and
> installing with maven, I am getting some errors.
>
> This is before I did anything with tesseract installed.
>
> Failed tests:   testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> Check for the image's text.
>   testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>   testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>
> Next I hard coded the tesseractPath:
>
> I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
> The all tests passed and it built successfully, but then I went to post
> some tiff's to the server.
> That didn't work. So I tried adding some System.out.println("hello world")
>  (a little crude I know) inside the unit tests to confirm that tesseract
> was working correctly.  It looks like something happens in the unit test in
> TesseractOCRTest.java
> on the line that says TesseractOCRConfig config = new
> TesseractOCRConfig();. Printing to stdout before works, but I get nothing
> after. That happens before the assumeTrue(canRun(config));. So an exception
> is not get raised.
>
> Then once everything is built, ocr does not work.  That was why I figured I
> would ask to see if I missed some sort of configuration step in building
> it.
>
> Thanks a ton.
>
>
>
>
>
> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
> [email protected]> wrote:
>
> > Dear Kevin,
> >
> > Sure, it already works :) 1.7-SNAPSHOT.
> >
> > See this wiki page:
> >
> > https://wiki.apache.org/tika/TikaOCR
> >
> > I¹d be happy to discuss more.
> >
> > Thanks!
> >
> > Cheers,
> > Chris
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: [email protected]
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: kevin slote <[email protected]>
> > Reply-To: "[email protected]" <[email protected]>
> > Date: Tuesday, September 30, 2014 at 8:52 AM
> > To: "[email protected]" <[email protected]>
> > Subject: OCR with tika-server
> >
> > >Hello all,
> > >
> > >I have been testing out the integration of tika with tesseract.
> > >I was wondering if there is  a way to get tika-server to run with
> > >tesseract's OCR capabilities?
> > >
> > >Best
> > >
> > >Kevin Slote
> >
> >
>

Re: OCR with tika-server

Reply via email to