Re: Tesseract language

Alexandre Rafalovitch Sun, 21 Oct 2018 07:27:01 -0700

There is a couple of things mixed in here:
1) Extract handler is not recommended for production usage. It is great for
a quick test, just like you did it, but going to production, running it
externally is better. Tika - especially with large files can use up a lot
of memory and trip up the Solr instance it is running within.
2) If you are still just testing, you can configure Tika within Solr but
specifying parseContent.config file as shown at the link and described
further down in the same document:
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler
You still need to check with Tika documentation with Tesseract can take its
configuration from the parseContext file.
3) If you are still testing with multiple files, Data Import Handler can
iterate through files and then - as a nested entity - feed it to Tika
processor for further extraction. I think one of the examples shows that.
However, I am not sure you can pass parseContext that way and DIH is also
not recommended for production.


I hope this helps,
    Alex.

On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <m...@kmd.dk> wrote:

> Hi again,
>
>
>
> Is there anyone who has some experience of using Tesseract’s OCR module
> within Solr? The files I am trying to read into Solr is Danish Tiff
> documents.
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> *Fra:* Martin Frank Hansen (MHQ) <m...@kmd.dk>
> *Sendt:* 18. oktober 2018 13:30
> *Til:* solr-user@lucene.apache.org
> *Emne:* Tesseract language
>
>
>
> Hi,
>
> I have been trying to use Tesseract through the data-import-handler in
> Solr and it actually works very well – with English. As the documents are
> in Danish, I need to change the language setting in Tesseract to Danish as
> well, is that possible from Solr?
>
>
>
> I was using the update/extract-handler to import single files into Solr,
> and it worked for a single file, how would I implement several files from a
> file-system?
>
>
>
> Here is the request-handler I used:
>
>
>
> <requestHandler name="/update/extract"
>
>                   startup="lazy"
>
>                   class="solr.extraction.ExtractingRequestHandler" >
>
>     <lst name="defaults">
>
>       <str name="lowernames">false</str>
>
>       <str name="uprefix">ignored_</str>
>
>       <str name="captureAttr">true</str>
>
>     </lst>
>
>   </requestHandler>
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s
> Privatlivspolitik <http://www.kmd.dk/Privatlivspolitik>, der fortæller,
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s
> Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how we
> process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi
> dig slette e-mailen i dit system uden at videresende eller kopiere den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri
> for virus og andre fejl, som kan påvirke computeren eller it-systemet,
> hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi
> påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse
> med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If you
> have received this message by mistake, please inform the sender of the
> mistake by sending a reply, then delete the message from your system
> without making, distributing or retaining any copies of it. Although we
> believe that the message and any attachments are free from viruses and
> other errors that might affect the computer or it-system where it is
> received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the receipt
> or use of this message.
>

Re: Tesseract language

Reply via email to