Here's a skeletal program that uses Tika in a stand-alone client. Rip the RDBMS parts out....
https://lucidworks.com/2012/02/14/indexing-with-solrj/ On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <arafa...@gmail.com> wrote: > > Usually, we just say to do a custom solution using SolrJ client to > connect. This gives you maximum flexibility and allows to integrate > Tika either inside your code or as a server. Latest Tika actually has > some off-thread handling I believe, to make it safer to embed. > > For DIH alternatives, if you want configuration over custom code, you > could look at something like Apache NiFI. It can push data into Solr. > Obviously it is a bigger solution, but it is correspondingly more > robust too. > > Regards, > Alex. > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <m...@kmd.dk> wrote: > > > > Hi Alexandre, > > > > Thanks for your reply. > > > > Yes right now it is just for testing the possibilities of Solr and > > Tesseract. > > > > I will take a look at the Tika documentation to see if I can make it work. > > > > You said that DIH are not recommended for production usage, what is the > > recommended method(s) to upload data to a Solr instance? > > > > Best regards > > > > Martin Frank Hansen > > > > -----Oprindelig meddelelse----- > > Fra: Alexandre Rafalovitch <arafa...@gmail.com> > > Sendt: 21. oktober 2018 16:26 > > Til: solr-user <solr-user@lucene.apache.org> > > Emne: Re: Tesseract language > > > > There is a couple of things mixed in here: > > 1) Extract handler is not recommended for production usage. It is great for > > a quick test, just like you did it, but going to production, running it > > externally is better. Tika - especially with large files can use up a lot > > of memory and trip up the Solr instance it is running within. > > 2) If you are still just testing, you can configure Tika within Solr but > > specifying parseContent.config file as shown at the link and described > > further down in the same document: > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler > > You still need to check with Tika documentation with Tesseract can take its > > configuration from the parseContext file. > > 3) If you are still testing with multiple files, Data Import Handler can > > iterate through files and then - as a nested entity - feed it to Tika > > processor for further extraction. I think one of the examples shows that. > > However, I am not sure you can pass parseContext that way and DIH is also > > not recommended for production. > > > > I hope this helps, > > Alex. > > > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <m...@kmd.dk> wrote: > > > > > Hi again, > > > > > > > > > > > > Is there anyone who has some experience of using Tesseract’s OCR > > > module within Solr? The files I am trying to read into Solr is Danish > > > Tiff documents. > > > > > > > > > > > > > > > > > > *Martin Frank Hansen*, Senior Data Analytiker > > > > > > Data, IM & Analytics > > > > > > [image: cid:image001.png@01D383C9.6C129A60] > > > > > > > > > Lautrupparken 40-42, DK-2750 Ballerup > > > E-mail m...@kmd.dk Web www.kmd.dk > > > Mobil +4525571418 > > > > > > > > > > > > *Fra:* Martin Frank Hansen (MHQ) <m...@kmd.dk> > > > *Sendt:* 18. oktober 2018 13:30 > > > *Til:* solr-user@lucene.apache.org > > > *Emne:* Tesseract language > > > > > > > > > > > > Hi, > > > > > > I have been trying to use Tesseract through the data-import-handler in > > > Solr and it actually works very well – with English. As the documents > > > are in Danish, I need to change the language setting in Tesseract to > > > Danish as well, is that possible from Solr? > > > > > > > > > > > > I was using the update/extract-handler to import single files into > > > Solr, and it worked for a single file, how would I implement several > > > files from a file-system? > > > > > > > > > > > > Here is the request-handler I used: > > > > > > > > > > > > <requestHandler name="/update/extract" > > > > > > startup="lazy" > > > > > > class="solr.extraction.ExtractingRequestHandler" > > > > > > > <lst name="defaults"> > > > > > > <str name="lowernames">false</str> > > > > > > <str name="uprefix">ignored_</str> > > > > > > <str name="captureAttr">true</str> > > > > > > </lst> > > > > > > </requestHandler> > > > > > > > > > > > > > > > > > > *Martin Frank Hansen*, Senior Data Analytiker > > > > > > Data, IM & Analytics > > > > > > [image: cid:image001.png@01D383C9.6C129A60] > > > > > > > > > Lautrupparken 40-42, DK-2750 Ballerup > > > E-mail m...@kmd.dk Web www.kmd.dk > > > Mobil +4525571418 > > > > > > > > > > > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her > > > finder du KMD’s Privatlivspolitik > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi > > > behandler oplysninger om dig. > > > > > > Protection of your personal data is important to us. Here you can read > > > KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how > > > we process your personal data. > > > > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst > > > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig > > > beder vi dig slette e-mailen i dit system uden at videresende eller > > > kopiere den. > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning > > > er fri for virus og andre fejl, som kan påvirke computeren eller > > > it-systemet, hvori den modtages og læses, åbnes den på modtagerens > > > eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er > > > opstået i forbindelse med at modtage og bruge e-mailen. > > > > > > Please note that this message may contain confidential information. If > > > you have received this message by mistake, please inform the sender of > > > the mistake by sending a reply, then delete the message from your > > > system without making, distributing or retaining any copies of it. > > > Although we believe that the message and any attachments are free from > > > viruses and other errors that might affect the computer or it-system > > > where it is received and read, the recipient opens the message at his or > > > her own risk. > > > We assume no responsibility for any loss or damage arising from the > > > receipt or use of this message. > > >