Re: Fwd: configuring Solr with Tesseract

Admin eLawJournal Mon, 06 Nov 2017 03:18:08 -0800

Hi Charlie,

Thanks for the reply. You're right. I haven't got my hands dirty with solr
yet. I am not from an IT background and learnt everything I know through
lots of reading online. However, all the documentation on solr assumes that
the reader has advanced IT knowledge. In fact, it took me a week to learn
to install and configure solr index to work with WordPress.

Getting solr to ocr appears to be beyond me. And I can't code.

*Would you consider setting this up for me for a fee? *

And also with a step by step guide for dummies in case I intend to upgrade
in the future.

I also noticed that Tika 1.14 is capable of ocr by itself. I would be okay
with a setup of solr using Tika 1.14 to ocr the PDF if that is possible.

Best regards,
Anand

On Nov 6, 2017 5:05 PM, "Charlie Hull" <char...@flax.co.uk> wrote:

On 03/11/2017 15:32, Admin eLawJournal wrote:

> Hi,
> I have read that we can use tesseract with solr to index image files. I
> would like some guidance on setting this up.
>
> Currently, I am using solr for searching my wordpress installation via the
> WPSOLR plugin.
>
> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
> wordpress.
>
> I have also installed tesseract but have no clue on configuring it.
>
>
> I am new to solr so will greatly appreciate a detailed step by step
> instruction.
>

Hi,

I'm guessing if you're using a preconfigured Solr plugin for WP you
probably haven't got your hands properly dirty with Solr yet.

One way to use Tesseract would be via Apache Tika
https://wiki.apache.org/tika/TikaOCR which is an awesome library for
extracting plain text from many different document formats and types.
There's a direct way to use Tesseract from within Solr (the
ExtractingRequestHandler https://lucene.apache.org/solr
/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.
html#uploading-data-with-solr-cell-using-apache-tika) but we don't
generally recommend this, as dodgy files can sometimes eat all your
resources during parsing and if Tika dies then so does Solr. We usually
process the files externally and the feed them to Solr using its HTTP API.

Here's one way to do it - a simple server wrapper around Tika
https://github.com/mattflax/dropwizard-tika-server written by my colleague
Matt Pearce.

So you're going to need to do some coding I think - Python would be a good
choice - to feed your source files to Tika for OCR and extraction, and then
the resulting text to Solr for indexing.

Cheers

Charlie

> Thank you very much
>
>

-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Fwd: configuring Solr with Tesseract

Reply via email to