As I said before, this is a great application for pay-as-needed cloud servers.
Netflix’s first use of Amazon EC2 was encoding movies for different screen sizes, data rates, codecs, and DRM. They would fire up a hundred or a thousand instances, feed movies to them, pick up the encodes, then release the instances. Ten years later, Amazon offers a service to do that (Elastic Transcoder): https://aws.amazon.com/elastictranscoder/ <https://aws.amazon.com/elastictranscoder/> Here is an example of configuring OCR using Amazon Lambda, which is how I would do it, both for OCR and PDF. http://stackoverflow.com/questions/33588262/tesseract-ocr-on-aws-lambda-via-virtualenv/35724894#35724894 <http://stackoverflow.com/questions/33588262/tesseract-ocr-on-aws-lambda-via-virtualenv/35724894#35724894> wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 30, 2017, at 5:50 AM, Allison, Timothy B. <talli...@mitre.org> wrote: > >> Note that the OCRing is a separate task from Solr indexing, and is best done >> on separate machines. > > +1 > > -----Original Message----- > From: Rick Leir [mailto:rl...@leirtech.com] > Sent: Thursday, March 30, 2017 7:37 AM > To: solr-user@lucene.apache.org > Subject: Re: Indexing speed reduced significantly with OCR > > The workflow is > -/ OCR new documents > -/ check quality and tune until you get good output text -/ keep the output > text in the file system > > -/ index and re-index to Solr as necessary from the file system > > Note that the OCRing is a separate task from Solr indexing, and is best done > on separate machines. I used all the old 'surplus' servers for OCR. > Cheers -- Rick > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity.