Here's a skeletal program that uses Tika in a stand-alone client. Rip
the RDBMS parts out....

https://lucidworks.com/2012/02/14/indexing-with-solrj/
On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch
<arafa...@gmail.com> wrote:
>
> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>    Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <m...@kmd.dk> wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and 
> > Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it work.
> >
> > You said that DIH are not recommended for production usage, what is the 
> > recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -----Oprindelig meddelelse-----
> > Fra: Alexandre Rafalovitch <arafa...@gmail.com>
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user <solr-user@lucene.apache.org>
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is great for 
> > a quick test, just like you did it, but going to production, running it 
> > externally is better. Tika - especially with large files can use up a lot 
> > of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr but 
> > specifying parseContent.config file as shown at the link and described 
> > further down in the same document:
> > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler
> > You still need to check with Tika documentation with Tesseract can take its 
> > configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler can 
> > iterate through files and then - as a nested entity - feed it to Tika 
> > processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is also 
> > not recommended for production.
> >
> > I hope this helps,
> >     Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <m...@kmd.dk> wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is Danish
> > > Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup
> > > E-mail m...@kmd.dk  Web www.kmd.dk
> > > Mobil +4525571418
> > >
> > >
> > >
> > > *Fra:* Martin Frank Hansen (MHQ) <m...@kmd.dk>
> > > *Sendt:* 18. oktober 2018 13:30
> > > *Til:* solr-user@lucene.apache.org
> > > *Emne:* Tesseract language
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have been trying to use Tesseract through the data-import-handler in
> > > Solr and it actually works very well – with English. As the documents
> > > are in Danish, I need to change the language setting in Tesseract to
> > > Danish as well, is that possible from Solr?
> > >
> > >
> > >
> > > I was using the update/extract-handler to import single files into
> > > Solr, and it worked for a single file, how would I implement several
> > > files from a file-system?
> > >
> > >
> > >
> > > Here is the request-handler I used:
> > >
> > >
> > >
> > > <requestHandler name="/update/extract"
> > >
> > >                   startup="lazy"
> > >
> > >                   class="solr.extraction.ExtractingRequestHandler" >
> > >
> > >     <lst name="defaults">
> > >
> > >       <str name="lowernames">false</str>
> > >
> > >       <str name="uprefix">ignored_</str>
> > >
> > >       <str name="captureAttr">true</str>
> > >
> > >     </lst>
> > >
> > >   </requestHandler>
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup
> > > E-mail m...@kmd.dk  Web www.kmd.dk
> > > Mobil +4525571418
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder du KMD’s Privatlivspolitik
> > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi 
> > > behandler oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can read
> > > KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how
> > > we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> > > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> > > beder vi dig slette e-mailen i dit system uden at videresende eller 
> > > kopiere den.
> > > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> > > er fri for virus og andre fejl, som kan påvirke computeren eller
> > > it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> > > eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> > > opstået i forbindelse med at modtage og bruge e-mailen.
> > >
> > > Please note that this message may contain confidential information. If
> > > you have received this message by mistake, please inform the sender of
> > > the mistake by sending a reply, then delete the message from your
> > > system without making, distributing or retaining any copies of it.
> > > Although we believe that the message and any attachments are free from
> > > viruses and other errors that might affect the computer or it-system
> > > where it is received and read, the recipient opens the message at his or 
> > > her own risk.
> > > We assume no responsibility for any loss or damage arising from the
> > > receipt or use of this message.
> > >

Reply via email to