RE: Tesseract language

Martin Frank Hansen (MHQ) Fri, 26 Oct 2018 12:31:51 -0700

Hi Tim,

You were right.


When I called `tesseract testing/eurotext.png testing/eurotext-dan -l dan`, I 
got an error message so I downloaded "dan.traineddata" and added it to the 
Tesseract-OCR/tessdata folder. Furthermore I added the 'TESSDATA_PREFIX' 
variable to the path-variables pointing to "Tesseract-OCR/tessdata".

Now Tesseract works with Danish language from the CMD, but now I can't make the 
code work in Java, not even with default settings (which I could before). Am I 
missing something or just mixing some things up?



-----Original Message-----
From: Tim Allison <talli...@apache.org>
Sent: 26. oktober 2018 19:58
To: solr-user@lucene.apache.org
Subject: Re: Tesseract language

Tika relies on you to install tesseract and all the language libraries you'll 
need.

If you can successfully call `tesseract testing/eurotext.png 
testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
with your code above.
On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ) <m...@kmd.dk> wrote:
>
> Hi again,
>
> Now I moved the OCR part to Tika, but I still can't make it work with Danish. 
> It works when using default language settings and it seems like Tika is 
> missing Danish dictionary.
>
> My java code looks like this:
>
> {
>             File file = new File(pathfilename);
>
>             Metadata meta = new Metadata();
>
>             InputStream stream = TikaInputStream.get(file);
>
>             Parser parser = new AutoDetectParser();
>             BodyContentHandler handler = new
> BodyContentHandler(Integer.MAX_VALUE);
>
>             TesseractOCRConfig config = new TesseractOCRConfig();
>             config.setLanguage("dan"); // code works if this phrase is 
> commented out.
>
>             ParseContext parseContext = new ParseContext();
>
>              parseContext.set(TesseractOCRConfig.class, config);
>
>             parser.parse(stream, handler, meta, parseContext);
>             System.out.println(handler.toString());
> }
>
> Hope that someone can help here.
>
> -----Original Message-----
> From: Martin Frank Hansen (MHQ) <m...@kmd.dk>
> Sent: 22. oktober 2018 07:58
> To: solr-user@lucene.apache.org
> Subject: SV: Tesseract language
>
> Hi Erick,
>
> Thanks for the help! I will take a look at it.
>
>
> Martin Frank Hansen, Senior Data Analytiker
>
> Data, IM & Analytics
>
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
> -----Oprindelig meddelelse-----
> Fra: Erick Erickson <erickerick...@gmail.com>
> Sendt: 21. oktober 2018 22:49
> Til: solr-user <solr-user@lucene.apache.org>
> Emne: Re: Tesseract language
>
> Here's a skeletal program that uses Tika in a stand-alone client. Rip the 
> RDBMS parts out....
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <arafa...@gmail.com> 
> wrote:
> >
> > Usually, we just say to do a custom solution using SolrJ client to
> > connect. This gives you maximum flexibility and allows to integrate
> > Tika either inside your code or as a server. Latest Tika actually
> > has some off-thread handling I believe, to make it safer to embed.
> >
> > For DIH alternatives, if you want configuration over custom code,
> > you could look at something like Apache NiFI. It can push data into Solr.
> > Obviously it is a bigger solution, but it is correspondingly more
> > robust too.
> >
> > Regards,
> >    Alex.
> > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <m...@kmd.dk> wrote:
> > >
> > > Hi Alexandre,
> > >
> > > Thanks for your reply.
> > >
> > > Yes right now it is just for testing the possibilities of Solr and 
> > > Tesseract.
> > >
> > > I will take a look at the Tika documentation to see if I can make it work.
> > >
> > > You said that DIH are not recommended for production usage, what is the 
> > > recommended method(s) to upload data to a Solr instance?
> > >
> > > Best regards
> > >
> > > Martin Frank Hansen
> > >
> > > -----Oprindelig meddelelse-----
> > > Fra: Alexandre Rafalovitch <arafa...@gmail.com>
> > > Sendt: 21. oktober 2018 16:26
> > > Til: solr-user <solr-user@lucene.apache.org>
> > > Emne: Re: Tesseract language
> > >
> > > There is a couple of things mixed in here:
> > > 1) Extract handler is not recommended for production usage. It is great 
> > > for a quick test, just like you did it, but going to production, running 
> > > it externally is better. Tika - especially with large files can use up a 
> > > lot of memory and trip up the Solr instance it is running within.
> > > 2) If you are still just testing, you can configure Tika within Solr but 
> > > specifying parseContent.config file as shown at the link and described 
> > > further down in the same document:
> > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-
> > > ce
> > > ll-using-apache-tika.html#configuring-the-solr-extractingrequestha
> > > nd ler You still need to check with Tika documentation with
> > > Tesseract can take its configuration from the parseContext file.
> > > 3) If you are still testing with multiple files, Data Import Handler can 
> > > iterate through files and then - as a nested entity - feed it to Tika 
> > > processor for further extraction. I think one of the examples shows that.
> > > However, I am not sure you can pass parseContext that way and DIH is also 
> > > not recommended for production.
> > >
> > > I hope this helps,
> > >     Alex.
> > >
> > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <m...@kmd.dk> 
> > > wrote:
> > >
> > > > Hi again,
> > > >
> > > >
> > > >
> > > > Is there anyone who has some experience of using Tesseract’s OCR
> > > > module within Solr? The files I am trying to read into Solr is
> > > > Danish Tiff documents.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > >
> > > >
> > > > *Fra:* Martin Frank Hansen (MHQ) <m...@kmd.dk>
> > > > *Sendt:* 18. oktober 2018 13:30
> > > > *Til:* solr-user@lucene.apache.org
> > > > *Emne:* Tesseract language
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I have been trying to use Tesseract through the
> > > > data-import-handler in Solr and it actually works very well –
> > > > with English. As the documents are in Danish, I need to change
> > > > the language setting in Tesseract to Danish as well, is that possible 
> > > > from Solr?
> > > >
> > > >
> > > >
> > > > I was using the update/extract-handler to import single files
> > > > into Solr, and it worked for a single file, how would I
> > > > implement several files from a file-system?
> > > >
> > > >
> > > >
> > > > Here is the request-handler I used:
> > > >
> > > >
> > > >
> > > > <requestHandler name="/update/extract"
> > > >
> > > >                   startup="lazy"
> > > >
> > > >                   class="solr.extraction.ExtractingRequestHandler"
> > > > >
> > > >
> > > >     <lst name="defaults">
> > > >
> > > >       <str name="lowernames">false</str>
> > > >
> > > >       <str name="uprefix">ignored_</str>
> > > >
> > > >       <str name="captureAttr">true</str>
> > > >
> > > >     </lst>
> > > >
> > > >   </requestHandler>
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > >
> > > >
> > > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > > finder du KMD’s Privatlivspolitik
> > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi 
> > > > behandler oplysninger om dig.
> > > >
> > > > Protection of your personal data is important to us. Here you
> > > > can read KMD’s Privacy Policy
> > > > <http://www.kmd.net/Privacy-Policy>
> > > > outlining how we process your personal data.
> > > >
> > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig 
> > > > information.
> > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > > venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > > Samtidig beder vi dig slette e-mailen i dit system uden at videresende 
> > > > eller kopiere den.
> > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > > computeren eller it-systemet, hvori den modtages og læses, åbnes
> > > > den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar
> > > > for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> > > > e-mailen.
> > > >
> > > > Please note that this message may contain confidential
> > > > information. If you have received this message by mistake,
> > > > please inform the sender of the mistake by sending a reply, then
> > > > delete the message from your system without making, distributing or 
> > > > retaining any copies of it.
> > > > Although we believe that the message and any attachments are
> > > > free from viruses and other errors that might affect the
> > > > computer or it-system where it is received and read, the recipient 
> > > > opens the message at his or her own risk.
> > > > We assume no responsibility for any loss or damage arising from
> > > > the receipt or use of this message.
> > > >

RE: Tesseract language

Reply via email to