Re: Indexing speed reduced significantly with OCR

Walter Underwood Thu, 30 Mar 2017 08:07:45 -0700

As I said before, this is a great application for pay-as-needed cloud servers.


Netflix’s first use of Amazon EC2 was encoding movies for different screen 
sizes, data rates, codecs, and DRM. They would fire up a hundred or a thousand 
instances, feed movies to them, pick up the encodes, then release the 
instances. 

Ten years later, Amazon offers a service to do that (Elastic Transcoder): 
https://aws.amazon.com/elastictranscoder/ 
<https://aws.amazon.com/elastictranscoder/>

Here is an example of configuring OCR using Amazon Lambda, which is how I would 
do it, both for OCR and PDF.

http://stackoverflow.com/questions/33588262/tesseract-ocr-on-aws-lambda-via-virtualenv/35724894#35724894
 
<http://stackoverflow.com/questions/33588262/tesseract-ocr-on-aws-lambda-via-virtualenv/35724894#35724894>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 30, 2017, at 5:50 AM, Allison, Timothy B. <talli...@mitre.org> wrote:
> 
>> Note that the OCRing is a separate task from Solr indexing, and is best done 
>> on separate machines.
> 
> +1
> 
> -----Original Message-----
> From: Rick Leir [mailto:rl...@leirtech.com] 
> Sent: Thursday, March 30, 2017 7:37 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing speed reduced significantly with OCR
> 
> The workflow is
> -/ OCR new documents
> -/ check quality and tune until you get good output text -/ keep the output 
> text in the file system
> 
> -/ index and re-index to Solr as necessary from the file system 
> 
> Note that the OCRing is a separate task from Solr indexing, and is best done 
> on separate machines. I used all the old 'surplus' servers for OCR.
> Cheers -- Rick
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Indexing speed reduced significantly with OCR

Reply via email to