Re: Fwd: configuring Solr with Tesseract

Charlie Hull Mon, 06 Nov 2017 01:06:07 -0800

On 03/11/2017 15:32, Admin eLawJournal wrote:

Hi,
I have read that we can use tesseract with solr to index image files. I
would like some guidance on setting this up.


Currently, I am using solr for searching my wordpress installation via the
WPSOLR plugin.

I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
wordpress.

I have also installed tesseract but have no clue on configuring it.


I am new to solr so will greatly appreciate a detailed step by step
instruction.

Hi,

I'm guessing if you're using a preconfigured Solr plugin for WP youprobably haven't got your hands properly dirty with Solr yet.

One way to use Tesseract would be via Apache Tikahttps://wiki.apache.org/tika/TikaOCR which is an awesome library forextracting plain text from many different document formats and types.There's a direct way to use Tesseract from within Solr (theExtractingRequestHandlerhttps://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html#uploading-data-with-solr-cell-using-apache-tika)but we don't generally recommend this, as dodgy files can sometimes eatall your resources during parsing and if Tika dies then so does Solr. Weusually process the files externally and the feed them to Solr using itsHTTP API.

Here's one way to do it - a simple server wrapper around Tikahttps://github.com/mattflax/dropwizard-tika-server written by mycolleague Matt Pearce.

So you're going to need to do some coding I think - Python would be agood choice - to feed your source files to Tika for OCR and extraction,and then the resulting text to Solr for indexing.


Cheers

Charlie


Thank you very much



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Fwd: configuring Solr with Tesseract

Reply via email to