RE: Solr OCR Support

2018-11-04 Thread Terry Steichen
+1 My experience is that you can't easily tell ahead of time whether your PDF is searchable or not. If it is, you may not even retrieve it because there's no text to index. Also, if you blindly OCR a file that has already been OCR'd, it can create a mess. Most higher end PDF editors have a

RE: Solr OCR Support

2018-11-04 Thread Phil Scadden
I would strongly consider OCR offline, BEFORE loading the documents into Solr. The advantage of this is that you convert your OCRed PDF into searchable PDF. Consider someone using Solr and they have found a document that matches their search criteria. Once they retrieve the document, they will

Re: Solr OCR Support

2018-11-02 Thread Tim Allison
g Nuance (or tesseract), I just wish to point out that > what to OCR is important, because OCR works well when it has good input. > > > -Original Message- > > From: Tim Allison > > Sent: Friday, November 2, 2018 11:03 AM > > To: solr-user@lucene.apache.org &

RE: Solr OCR Support

2018-11-02 Thread Davis, Daniel (NIH/NLM) [C]
11:03 AM > To: solr-user@lucene.apache.org > Subject: Re: Solr OCR Support > > OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr! We > have an open ticket to make it "just work", but we aren't there yet > (TIKA-2749). > > You have to tell Tika how you want

Re: Solr OCR Support

2018-11-02 Thread Tim Allison
OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr! We have an open ticket to make it "just work", but we aren't there yet (TIKA-2749). You have to tell Tika how you want to process images from PDFs via the tika-config.xml file. You've seen this link in the links you mentioned: