I would strongly consider OCR offline, BEFORE loading the documents into Solr. 
The  advantage of this is that you convert your OCRed PDF into searchable PDF. 
Consider someone using Solr and they have found a document that matches their 
search criteria. Once they retrieve the document, they will discover it is has 
not been OCRed and they cannot use a text search within a document. If the 
document that you are feeding Solr is large, then this is major pain. Setting 
up Tesseract (or whatever engine - tesseract involves a bit of a tool chain) to 
OCR and save as searchable PDF, means you can provide a much more useful 
document as the result of Solr search. Feed that searchable PDF to SolrJ with 
OCR turned off.

               PDFParserConfig pdfConfig = new PDFParserConfig();
               pdfConfig.setExtractInlineImages(false);
               pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
               context.set(PDFParserConfig.class,pdfConfig);
               context.set(Parser.class,parser);

-----Original Message-----
From: Furkan KAMACI <furkankam...@gmail.com>
Sent: Saturday, 3 November 2018 03:30
To: solr-user@lucene.apache.org
Subject: Solr OCR Support

Hi All,

I want to index images and pdf documents which have images into Solr. I test it 
with my Solr 6.3.0.

I've installed tesseract at my computer (Mac). I verify that Tesseract works 
fine to extract text from an image.

I index image into Solr but it has no content. However, as far as I know, I 
don't need to do anything else to integrate Tesseract with Solr.

I've checked these but they were not useful for me:

http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html

My question is, how can I support OCR with Solr?

Kind Regards,
Furkan KAMACI
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Reply via email to