Re: [CODE4LIB] OCR PDFs

Binkley, Peter Fri, 17 Oct 2008 11:41:25 -0700

And beyond Tesseract is Ocropus (http://code.google.com/p/ocropus/),
which uses Tesseract (and eventually other ocr engines) to generate
positional OCR in an HTML format. I wonder if you could process that
HTML slightly to put the TIFF in the background, then use an HTML to PDF
tool to generate your final PDF. Or something like that. Googling
"ocropus pdf" finds a few projects and discussions that might be
helpful.


Peter 

> -----Original Message-----
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On 
> Behalf Of Bridger Dyson-Smith
> Sent: Friday, October 17, 2008 6:56 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] OCR PDFs
> 
> If you haven't already, take a look at tesseract ( 
> http://code.google.com/p/tesseract-ocr/). There's some 
> discussion of using tesseract and shell scripting to work 
> with tiffs to pdfs to ocr'd text, which isn't exactly what 
> you're wanting to do, I know, but may prove helpful 
> (http://www.groklaw.net/articlebasic.php?story=20061210115516438).
> Cheers!
> Bridger Dyson-Smith
> 
> 
> On Fri, Oct 17, 2008 at 8:28 AM, Terry Harrison 
> <[EMAIL PROTECTED]> wrote:
> 
> > You might want to look at ABBYY Fine Reader 9.0 Professional, which 
> > can be driven from the command line.  Fine Reader  is used at the 
> > Library of Congress.  Here is a info link to get you 
> started (search "command"):
> >
> >
> > 
> http://www.scanstore.com/Scanning/Document_Imaging/Software/OCR_Softwa
> > re/Nuance/omnipage_review.asp
> >
> > Regards,
> > Terry
> >
> > ------------------------------------
> > Terry Harrison
> > Project Manager
> > CACI
> > 5505 Robin Hood Road, Suite F
> > Norfolk, Va. 23508
> > Ph: 757.321.9120 x232
> > Fax: 757.321.8797
> > [EMAIL PROTECTED]
> >
> 
>

Re: [CODE4LIB] OCR PDFs

Reply via email to