Re: Indexing scanned PDFs

2014-05-06 Thread Jack Krupansky
Also, be aware that there a a lot of PDF files that have text which is the 
result of a low-accuracy OCR scan of the page images in the PDF file. 
High-accuracy OCR scan is rather expensive. You can usually tell if you have 
a scanned PDF by zooming way in - a PDF file generated directly from a 
word processor source file will retain smooth curves on characters while a 
PDF generated from scanned page images will show heavy pixelation.


-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Tuesday, May 6, 2014 1:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing scanned PDFs

Nothing I am aware of for Solr directly. You may have better luck
chasing this at TIKA mailing list, as that's what Solr uses under
covers to index PDF otherwise. Doing a quick search for Tika and OCR
brings up a number of links.

Regards,
 Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr 
proficiency



On Tue, May 6, 2014 at 12:15 PM, Chandan Tamrakar
chandan.tamra...@nepasoft.com wrote:

we are using SOLr to index pdf documents but there are cases where PDFs
are usually a scanned document  with no text to extract and index .

Is there a plugin or module in SOLR that we can integrate so that it would
actually extract a text / OCR and then index?


Thanks in advance

Chandan Tamrakar 




Indexing scanned PDFs

2014-05-05 Thread Chandan Tamrakar
​we are using SOLr to index pdf documents but there are cases where PDFs
are usually a scanned document  with no text to extract and index .

Is there a plugin or module in SOLR that we can integrate so that it would
actually extract a text / OCR and then index?


Thanks in advance

Chandan Tamrakar


Re: Indexing scanned PDFs

2014-05-05 Thread Alexandre Rafalovitch
Nothing I am aware of for Solr directly. You may have better luck
chasing this at TIKA mailing list, as that's what Solr uses under
covers to index PDF otherwise. Doing a quick search for Tika and OCR
brings up a number of links.

Regards,
  Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, May 6, 2014 at 12:15 PM, Chandan Tamrakar
chandan.tamra...@nepasoft.com wrote:
 we are using SOLr to index pdf documents but there are cases where PDFs
 are usually a scanned document  with no text to extract and index .

 Is there a plugin or module in SOLR that we can integrate so that it would
 actually extract a text / OCR and then index?


 Thanks in advance

 Chandan Tamrakar