Tesseract is going to be slow, and there might not much you can do about
that.
You can do a couple of things, like set up a processes that run on AWS EC2
spot instances, so you can put a standing bid order on AWS instances and
only run your OCR when the price drops.
Or you can buy ABBYY , which
Art Rhyno talked about doing this with scans of old community newspapers
a few years ago (https://www.youtube.com/watch?v=gcjCiS9pJ3A)
Yes, it's very compute intensive and slow. He set up Hadoop to farm jobs
out to the PCs in the library's public lab while the library was closed
at night.
-
Howdy all,
I've just started a project that involves harvesting large numbers of
scanned PDF's and extracting information from the text from the OCR output.
The process I've started with -- use imagemagick to convert to tiff and
tesseract to pull out the OCR -- is more system intensive than I
On 2014-12-09 14:25, Kyle Banerjee wrote:
Howdy all,
I've just started a project that involves harvesting large numbers of
scanned PDF's and extracting information from the text from the OCR output.
The process I've started with -- use imagemagick to convert to tiff and
tesseract to pull out
On Dec 9, 2014, at 8:25 AM, Kyle Banerjee kyle.baner...@gmail.com wrote:
I've just started a project that involves harvesting large numbers of
scanned PDF's and extracting information from the text from the OCR output.
The process I've started with -- use imagemagick to convert to tiff and