RE: Tesseract language

Martin Frank Hansen (MHQ) Sun, 28 Oct 2018 08:55:27 -0700

Hi Tim and Rohan,

Really appreciate your help, and I finally made it work (without tess4j).


It was the path-environment variable which had a wrong setting. Instead setting 
the path of TESSDATA_PREFIX to  'Tesseract-OCR/tessdata' I changed it to the 
parent folder 'Tesseract-OCR' and now it works for Danish.

Thanks again for helping.

Best regards

Martin

-----Original Message-----
From: Tim Allison <talli...@apache.org>
Sent: 27. oktober 2018 14:37
To: solr-user@lucene.apache.org; u...@tika.apache.org
Subject: Re: Tesseract language

Martin,
  Let’s move this over to user@tika.

Rohan,
  Is there something about Tika’s use of tesseract for image files that can be 
improved?

    Best,
       Tim

On Sat, Oct 27, 2018 at 3:40 AM Rohan Kasat <rohan.ka...@gmail.com> wrote:

> I used tess4j for image formats and Tika for scanned PDFs and images
> within PDFs.
>
> Regards,
> Rohan Kasat
>
> On Sat, Oct 27, 2018 at 12:39 AM Martin Frank Hansen (MHQ)
> <m...@kmd.dk>
> wrote:
>
> > Hi Rohan,
> >
> > Thanks for your reply, are you using tess4j with Tika or on its own?
> > I will take a look at tess4j if I can't make it work with Tika alone.
> >
> > Best regards
> > Martin
> >
> >
> > -----Original Message-----
> > From: Rohan Kasat <rohan.ka...@gmail.com>
> > Sent: 26. oktober 2018 21:45
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tesseract language
> >
> > Hi Martin,
> >
> > Are you using it For image formats , I think you can try tess4j and
> > use give TESSDATA_PREFIX as the home for tessarct Configs.
> >
> > I have tried it and it works pretty well in my local machine.
> >
> > I have used java 8 and tesseact 3 for the same.
> >
> > Regards,
> > Rohan Kasat
> >
> > On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ)
> > <m...@kmd.dk>
> > wrote:
> >
> > > Hi Tim,
> > >
> > > You were right.
> > >
> > > When I called `tesseract testing/eurotext.png testing/eurotext-dan
> > > -l dan`, I got an error message so I downloaded "dan.traineddata"
> > > and added it to the Tesseract-OCR/tessdata folder. Furthermore I
> > > added the 'TESSDATA_PREFIX' variable to the path-variables
> > > pointing to "Tesseract-OCR/tessdata".
> > >
> > > Now Tesseract works with Danish language from the CMD, but now I
> > > can't make the code work in Java, not even with default settings
> > > (which I could before). Am I missing something or just mixing some things 
> > > up?
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Tim Allison <talli...@apache.org>
> > > Sent: 26. oktober 2018 19:58
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Tesseract language
> > >
> > > Tika relies on you to install tesseract and all the language
> > > libraries you'll need.
> > >
> > > If you can successfully call `tesseract testing/eurotext.png
> > > testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> > > with your code above.
> > > On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> > > <m...@kmd.dk>
> > > wrote:
> > > >
> > > > Hi again,
> > > >
> > > > Now I moved the OCR part to Tika, but I still can't make it work
> > > > with
> > > Danish. It works when using default language settings and it seems
> > > like Tika is missing Danish dictionary.
> > > >
> > > > My java code looks like this:
> > > >
> > > > {
> > > >             File file = new File(pathfilename);
> > > >
> > > >             Metadata meta = new Metadata();
> > > >
> > > >             InputStream stream = TikaInputStream.get(file);
> > > >
> > > >             Parser parser = new AutoDetectParser();
> > > >             BodyContentHandler handler = new
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > >
> > > >             TesseractOCRConfig config = new TesseractOCRConfig();
> > > >             config.setLanguage("dan"); // code works if this
> > > > phrase is
> > > commented out.
> > > >
> > > >             ParseContext parseContext = new ParseContext();
> > > >
> > > >              parseContext.set(TesseractOCRConfig.class, config);
> > > >
> > > >             parser.parse(stream, handler, meta, parseContext);
> > > >             System.out.println(handler.toString());
> > > > }
> > > >
> > > > Hope that someone can help here.
> > > >
> > > > -----Original Message-----
> > > > From: Martin Frank Hansen (MHQ) <m...@kmd.dk>
> > > > Sent: 22. oktober 2018 07:58
> > <https://maps.google.com/?q=tober+2018+07:58&entry=gmail&source=g>
> > > > To: solr-user@lucene.apache.org
> > > > Subject: SV: Tessera
> > > <https://maps.google.com/?q=ect:+SV:+Tessera&entry=gmail&source=g>
> > > ct
> > > language
> > > >
> > > > Hi Erick,
> > > >
> > > > Thanks for the help! I will take a look at it.
> > > >
> > > >
> > > > Martin Frank Hansen, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > > -----Oprindelig meddelelse-----
> > > > Fra: Erick Erickson <erickerick...@gmail.com>
> > > > Sendt: 21. oktober 2018 22:49
> > > > Til: solr-user <solr-user@lucene.apache.org>
> > > > Emne: Re: Tesseract language
> > > >
> > > > Here's a skeletal program that uses Tika in a stand-alone client.
> > > > Rip
> > > the RDBMS parts out....
> > > >
> > > > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > > > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> > > arafa...@gmail.com> wrote:
> > > > >
> > > > > Usually, we just say to do a custom solution using SolrJ
> > > > > client to connect. This gives you maximum flexibility and
> > > > > allows to integrate Tika either inside your code or as a
> > > > > server. Latest Tika actually has some off-thread handling I
> > > > > believe, to make it safer
> to
> > embed.
> > > > >
> > > > > For DIH alternatives, if you want configuration over custom
> > > > > code, you could look at something like Apache NiFI. It can
> > > > > push data into
> > > Solr.
> > > > > Obviously it is a bigger solution, but it is correspondingly
> > > > > more robust too.
> > > > >
> > > > > Regards,
> > > > >    Alex.
> > > > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)
> > > > > <m...@kmd.dk>
> > > wrote:
> > > > > >
> > > > > > Hi Alexandre,
> > > > > >
> > > > > > Thanks for your reply.
> > > > > >
> > > > > > Yes right now it is just for testing the possibilities of
> > > > > > Solr and
> > > Tesseract.
> > > > > >
> > > > > > I will take a look at the Tika documentation to see if I can
> > > > > > make it
> > > work.
> > > > > >
> > > > > > You said that DIH are not recommended for production usage,
> > > > > > what is
> > > the recommended method(s) to upload data to a Solr instance?
> > > > > >
> > > > > > Best regards
> > > > > >
> > > > > > Martin Frank Hansen
> > > > > >
> > > > > > -----Oprindelig meddelelse-----
> > > > > > Fra: Alexandre Rafalovitch <arafa...@gmail.com>
> > > > > > Sendt: 21. oktober 2018 16:26
> > > > > > Til: solr-user <solr-user@lucene.apache.org>
> > > > > > Emne: Re: Tesseract language
> > > > > >
> > > > > > There is a couple of things mixed in here:
> > > > > > 1) Extract handler is not recommended for production usage.
> > > > > > It is
> > > great for a quick test, just like you did it, but going to
> > > production, running it externally is better. Tika - especially
> > > with large files can use up a lot of memory and trip up the Solr
> > > instance it is running
> > within.
> > > > > > 2) If you are still just testing, you can configure Tika
> > > > > > within Solr
> > > but specifying parseContent.config file as shown at the link and
> > > described further down in the same document:
> > > > > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with
> > > > > > -sol
> > > > > > r-
> > > > > > ce
> > > > > > ll-using-apache-tika.html#configuring-the-solr-extractingreq
> > > > > > uest ha nd ler You still need to check with Tika
> > > > > > documentation with Tesseract can take its configuration from
> > > > > > the parseContext file.
> > > > > > 3) If you are still testing with multiple files, Data Import
> > > > > > Handler
> > > can iterate through files and then - as a nested entity - feed it
> > > to Tika processor for further extraction. I think one of the
> > > examples
> shows
> > that.
> > > > > > However, I am not sure you can pass parseContext that way
> > > > > > and DIH is
> > > also not recommended for production.
> > > > > >
> > > > > > I hope this helps,
> > > > > >     Alex.
> > > > > >
> > > > > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)
> > > > > > <m...@kmd.dk>
> > > wrote:
> > > > > >
> > > > > > > Hi again,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Is there anyone who has some experience of using
> > > > > > > Tesseract’s OCR module within Solr? The files I am trying
> > > > > > > to read into Solr is Danish Tiff documents.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > > > >
> > > > > > > Data, IM & Analytics
> > > > > > >
> > > > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > > > >
> > > > > > >
> > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk
> > > > > > > Web www.kmd.dk Mobil +4525571418
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Fra:* Martin Frank Hansen (MHQ) <m...@kmd.dk>
> > > > > > > *Sendt:* 18. oktober
> > <https://maps.google.com/?q=t:*+18.+oktober+&entry=gmail&source=g>20
> > 18
> > 13:30
> > > > > > > *Til:* solr-user@lucene.apache.org
> > > > > > > *Emne:* Tesseract language
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have been trying to use Tesseract through the
> > > > > > > data-import-handler in Solr and it actually works very
> > > > > > > well – with English. As the documents are in Danish, I
> > > > > > > need to change the language setting in Tesseract to
> > > <https://maps.google.com/?q=in+Tesseract+to+&entry=gmail&source=g>
> > > Dani
> > > sh
> > > as well, is that possible from Solr?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I was using the update/extract-handler to import single
> > > > > > > files into Solr, and it worked for a single file, how
> > > > > > > would I implement several files from a file-system?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Here is the request-handler I used:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > <requestHandler name="/update/extract"
> > > > > > >
> > > > > > >                   startup="lazy"
> > > > > > >
> > > > > > >
> >  class="solr.extraction.ExtractingRequestHandler"
> > > > > > > >
> > > > > > >
> > > > > > >     <lst name="defaults">
> > > > > > >
> > > > > > >       <str name="lowernames">false</str>
> > > > > > >
> > > > > > >       <str name="uprefix">ignored_</str>
> > > > > > >
> > > > > > >       <str name="captureAttr">true</str>
> > > > > > >
> > > > > > >     </lst>
> > > > > > >
> > > > > > >   </requestHandler>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > > > >
> > > > > > > Data, IM & Analytics
> > > > > > >
> > > > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > > > >
> > > > > > >
> > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk
> > > > > > > Web www.kmd.dk Mobil +4525571418
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Beskyttelse af dine personlige oplysninger er vigtig for os.
> > > > > > > Her finder du KMD’s Privatlivspolitik
> > > > > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller,
> > > > > > > hvordan vi
> > > behandler oplysninger om dig.
> > > > > > >
> > > > > > > Protection of your personal data is important to us. Here
> > > > > > > you can read KMD’s Privacy Policy
> > > > > > > <http://www.kmd.net/Privacy-Policy>
> > > > > > > outlining how we process your personal data.
> > > > > > >
> > > > > > > Vi gør opmærksom på, at denne e-mail kan indeholde
> > > > > > > fortrolig
> > > information.
> > > > > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > > > > > venligst informere afsender om fejlen ved at bruge
> > svarfunktionen.
> > > > > > > Samtidig beder vi dig slette e-mailen i dit system uden at
> > > videresende eller kopiere den.
> > > > > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > > > > > overbevisning er fri for virus og andre fejl, som kan
> > > > > > > påvirke computeren eller it-systemet, hvori den modtages
> > > > > > > og læses, åbnes den på modtagerens eget ansvar. Vi påtager
> > > > > > > os ikke noget ansvar for tab og skade, som er opstået i
> > > > > > > forbindelse med at modtage og
> > > bruge e-mailen.
> > > > > > >
> > > > > > > Please note that this message may contain confidential
> > > > > > > information. If you have received this message by mistake,
> > > > > > > please inform the sender of the mistake by sending a
> > > > > > > reply, then delete the message from your system without
> > > > > > > making, distributing
> > > or retaining any copies of it.
> > > > > > > Although we believe that the message and any attachments
> > > > > > > are free from viruses and other errors that might affect
> > > > > > > the computer or it-system where it is received and read,
> > > > > > > the recipient
> > > opens the message at his or her own risk.
> > > > > > > We assume no responsibility for any loss or damage arising
> > > > > > > from the receipt or use of this message.
> > > > > > >
> > >
> > --
> >
> > *Regards,Rohan Kasat*
> >
> --
>
> *Regards,Rohan Kasat*
>

RE: Tesseract language

Reply via email to