+1
My experience is that you can't easily tell ahead of time whether your PDF is 
searchable or not. If it is, you may not even retrieve it because there's no 
text to index.  Also, if you blindly OCR a file that has already been OCR'd, it 
can create a mess.  Most higher end PDF editors have a batch mode to do OCR 
processing, if that works better for you.

On November 4, 2018 5:20:41 PM EST, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>I would strongly consider OCR offline, BEFORE loading the documents
>into Solr. The  advantage of this is that you convert your OCRed PDF
>into searchable PDF. Consider someone using Solr and they have found a
>document that matches their search criteria. Once they retrieve the
>document, they will discover it is has not been OCRed and they cannot
>use a text search within a document. If the document that you are
>feeding Solr is large, then this is major pain. Setting up Tesseract
>(or whatever engine - tesseract involves a bit of a tool chain) to OCR
>and save as searchable PDF, means you can provide a much more useful
>document as the result of Solr search. Feed that searchable PDF to
>SolrJ with OCR turned off.
>
>               PDFParserConfig pdfConfig = new PDFParserConfig();
>               pdfConfig.setExtractInlineImages(false);
>         pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
>               context.set(PDFParserConfig.class,pdfConfig);
>               context.set(Parser.class,parser);
>
>-----Original Message-----
>From: Furkan KAMACI <furkankam...@gmail.com>
>Sent: Saturday, 3 November 2018 03:30
>To: solr-user@lucene.apache.org
>Subject: Solr OCR Support
>
>Hi All,
>
>I want to index images and pdf documents which have images into Solr. I
>test it with my Solr 6.3.0.
>
>I've installed tesseract at my computer (Mac). I verify that Tesseract
>works fine to extract text from an image.
>
>I index image into Solr but it has no content. However, as far as I
>know, I don't need to do anything else to integrate Tesseract with
>Solr.
>
>I've checked these but they were not useful for me:
>
>http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
>http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html
>
>My question is, how can I support OCR with Solr?
>
>Kind Regards,
>Furkan KAMACI
>Notice: This email and any attachments are confidential and may not be
>used, published or redistributed without the prior written consent of
>the Institute of Geological and Nuclear Sciences Limited (GNS Science).
>If received in error please destroy and immediately notify GNS Science.
>Do not copy or disclose the contents.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply via email to