Re: [CODE4LIB] Scanned PDF to text
Tesseract is going to be slow, and there might not much you can do about that. You can do a couple of things, like set up a processes that run on AWS EC2 spot instances, so you can put a standing bid order on AWS instances and only run your OCR when the price drops. Or you can buy ABBYY , which is much faster. b,chris. b,chris. On Tue, Dec 9, 2014 at 5:45 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: I’m not quite sure if I understand the question, but if all you want to do is pull the text out of an OCR’ed PDF file, then I have found both Tika and PDFtotext to be useful tools On the other hand, if you need to do the OCR itself, then employing Tesseract is probably the way to go. For clarity, I have to do the OCR itself. I've been using CAM::PDF to extract existing text. Kyle
Re: [CODE4LIB] Scanned PDF to text
Art Rhyno talked about doing this with scans of old community newspapers a few years ago (https://www.youtube.com/watch?v=gcjCiS9pJ3A) Yes, it's very compute intensive and slow. He set up Hadoop to farm jobs out to the PCs in the library's public lab while the library was closed at night. - David On 2014/12/11 03:59, Chris Fitzpatrick wrote: Tesseract is going to be slow, and there might not much you can do about that. You can do a couple of things, like set up a processes that run on AWS EC2 spot instances, so you can put a standing bid order on AWS instances and only run your OCR when the price drops. Or you can buy ABBYY , which is much faster. b,chris. b,chris. On Tue, Dec 9, 2014 at 5:45 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: I’m not quite sure if I understand the question, but if all you want to do is pull the text out of an OCR’ed PDF file, then I have found both Tika and PDFtotext to be useful tools On the other hand, if you need to do the OCR itself, then employing Tesseract is probably the way to go. For clarity, I have to do the OCR itself. I've been using CAM::PDF to extract existing text. Kyle
[CODE4LIB] Scanned PDF to text
Howdy all, I've just started a project that involves harvesting large numbers of scanned PDF's and extracting information from the text from the OCR output. The process I've started with -- use imagemagick to convert to tiff and tesseract to pull out the OCR -- is more system intensive than I hoped it would be. Is there an easier/faster process that I'm missing? Perl friendly solutions are preferred because this fits in as part of a larger process. If I am already using my best option, what kind of image parameters are recommended if I want to hit the point of diminishing returns but not necessarily go for the best possible? Thanks, kyle
Re: [CODE4LIB] Scanned PDF to text
On 2014-12-09 14:25, Kyle Banerjee wrote: Howdy all, I've just started a project that involves harvesting large numbers of scanned PDF's and extracting information from the text from the OCR output. The process I've started with -- use imagemagick to convert to tiff and tesseract to pull out the OCR -- is more system intensive than I hoped it would be. I asked around the office and the process seems sensible overall. One suggestion was to use pdfimages instead of imagemagick as that should be faster. However I would guess that most of the processing time is actually spent in tesseract so I don't know how much this suggestion will improve the overall performance. Regards. -- Mads Villadsen m...@statsbiblioteket.dk Statsbiblioteket It-udvikler
Re: [CODE4LIB] Scanned PDF to text
On Dec 9, 2014, at 8:25 AM, Kyle Banerjee kyle.baner...@gmail.com wrote: I've just started a project that involves harvesting large numbers of scanned PDF's and extracting information from the text from the OCR output. The process I've started with -- use imagemagick to convert to tiff and tesseract to pull out the OCR -- is more system intensive than I hoped it would be. I’m not quite sure if I understand the question, but if all you want to do is pull the text out of an OCR’ed PDF file, then I have found both Tika and PDFtotext to be useful tools. [1, 2] Here’s a Perl script that takes a PDF as input and used to Tika to output the OCR’ed text: #!/usr/bin/perl # configure use constant TIKA = 'java -jar tika.jar -T '; # require use strict; # initialize; needs sanity checking my $cmd = TIKA . $ARGV[ 0 ]; # do the work print system $cmd; # done exit; Tika can run in a server mode making it more efficient for extracting the text from multiple files. On the other hand, if you need to do the OCR itself, then employing Tesseract is probably the way to go. [1] Tika - http://tika.apache.org [2] PDFtoText - http://www.foolabs.com/xpdf/download.html — ELM