Thanks everyone for their ideas and suggestions! Some had occurred to us but were discarded because we feel our solution needs to be automated -- 45 million pages are a lot of thrust on any human-driven effort.
I like Itamar's idea of doing "competing" OCR, and keeping the best result. Unfortunately OCR software is far from cheap, and the cost of 2 different product licenses may be too high for the project. I've also looked into the Tesseract/OCRopus, but while the ideas are good it ain't there yet. > On Jan 25, 2008 6:12 AM, mark harwood <[EMAIL PROTECTED]> wrote: > >> Probably not a practical solution for you to set up but I love this >> idea: >> http://blog.wired.com/monkeybites/2007/05/recaptcha_fight.html >> --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]