Re: Solr OCR Support

Tim Allison Fri, 02 Nov 2018 09:55:15 -0700
+1 Thank you, Daniel.  If you have any interest in helping out on
TIKA-2749, please join the fun. :D
On Fri, Nov 2, 2018 at 12:12 PM Davis, Daniel (NIH/NLM) [C]
<daniel.da...@nih.gov> wrote:
>
> I think that you also have to process a PDF pretty deeply to decide if you 
> want it to be OCR.   I have worked on projects where all of the PDFs are 
> really like faxes - images are encoded in JBIG2 black and white or similar, 
> and there is really one image per page, and no text.   I have also worked on 
> projects where it really is unstructured data, but if a PDF has one image per 
> page and have no text, they should be OCRd.
>
> I've had problems, not with Tesseract, but even with Nuance OCR OEM 
> libraries, where text was missed because one image was the top of the 
> letters, and the image on the next line was the bottom half of the letters.   
> I don't mean to ding Nuance (or tesseract), I just wish to point out that 
> what to OCR is important, because OCR works well when it has good input.
>
> > -----Original Message-----
> > From: Tim Allison <talli...@apache.org>
> > Sent: Friday, November 2, 2018 11:03 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr OCR Support
> >
> > OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr!  We
> > have an open ticket to make it "just work", but we aren't there yet
> > (TIKA-2749).
> >
> > You have to tell Tika how you want to process images from PDFs via the
> > tika-config.xml file.
> >
> > You've seen this link in the links you mentioned:
> > https://wiki.apache.org/tika/TikaOCR
> >
> > This one is key for PDFs:
> > https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
> > On Fri, Nov 2, 2018 at 10:30 AM Furkan KAMACI <furkankam...@gmail.com>
> > wrote:
> > >
> > > Hi All,
> > >
> > > I want to index images and pdf documents which have images into Solr. I
> > > test it with my Solr 6.3.0.
> > >
> > > I've installed tesseract at my computer (Mac). I verify that Tesseract
> > > works fine to extract text from an image.
> > >
> > > I index image into Solr but it has no content. However, as far as I know, 
> > > I
> > > don't need to do anything else to integrate Tesseract with Solr.
> > >
> > > I've checked these but they were not useful for me:
> > >
> > > http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-
> > td4201834.html
> > > http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-
> > td4361908.html
> > >
> > > My question is, how can I support OCR with Solr?
> > >
> > > Kind Regards,
> > > Furkan KAMACI
Re: Solr OCR Support

Reply via email to