On 03/11/2017 15:32, Admin eLawJournal wrote:
Hi,
I have read that we can use tesseract with solr to index image files. I
would like some guidance on setting this up.

Currently, I am using solr for searching my wordpress installation via the
WPSOLR plugin.

I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
wordpress.

I have also installed tesseract but have no clue on configuring it.


I am new to solr so will greatly appreciate a detailed step by step
instruction.

Hi,

I'm guessing if you're using a preconfigured Solr plugin for WP you probably haven't got your hands properly dirty with Solr yet.

One way to use Tesseract would be via Apache Tika https://wiki.apache.org/tika/TikaOCR which is an awesome library for extracting plain text from many different document formats and types. There's a direct way to use Tesseract from within Solr (the ExtractingRequestHandler https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html#uploading-data-with-solr-cell-using-apache-tika) but we don't generally recommend this, as dodgy files can sometimes eat all your resources during parsing and if Tika dies then so does Solr. We usually process the files externally and the feed them to Solr using its HTTP API.

Here's one way to do it - a simple server wrapper around Tika https://github.com/mattflax/dropwizard-tika-server written by my colleague Matt Pearce.

So you're going to need to do some coding I think - Python would be a good choice - to feed your source files to Tika for OCR and extraction, and then the resulting text to Solr for indexing.

Cheers

Charlie


Thank you very much



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to