On Sunday 18 March 2007 08:07:40 am Anders Norrbring wrote: > > Thanks, I'm not locked to Linux for this adventure, but the images are > stored on a Linux system but can be accessed from Windows. > > Maybe I should explain more what I'm about to do, it's not really a full > text scan.. > The images are from a dozen or so different photographers, who all put > their copyright notice in text on every image. What I want to accomplish > is to categorize them all according to who took the picture, in other > words, sort them by photographer name. So, the OCR should only read one > or two words out of a maximum of 4-6 words somewhere in the image. > It's also a one-time thing to do, so I cannot motivate a license cost > for a fully fledged OCR suite. > > I'll take a look at the links you provided, thanks!
Well, I'm only half-joking when I say you might be better off just hiring a few illegal aliens and having them do data entry. OCR really only lends itself to higher volume work that is repeated often. It has high setup costs and takes a huge amount of time to get the confidence levels to anything above 80%. I have staff members, who spend a majority of their time simply tweaking and refining OCR templates to ensure the scanned forms get above 95% accuracy. (When looking at 3M documents times 4+ fields per document, you have roughly 12,000,000 OCR fields per year. 95% accuracy means that you still have 600,000 fields incorrect, which translates to a huge labor cost.) In any case, I wish you luck. That all said, I just checked with SMART and found two OCR items we have available. 1. gOCR - GOCR is a free OCR (Optical Character Recognition) project that provides a library, a command line version, and an X interface. Although the program is in an early development state, the results are very impressive. 2. GNU Ocrad is an OCR (Optical Character Recognition) program implemented as a filter and based on a feature extraction method. It reads a bitmap image in PBM format and outputs text in the ISO-8859-1 (Latin-1) charset. It can be used as a stand-alone console application or as a back-end to other programs. gocr is another interesting command line OCR tool. Both can be plugged into Kooka, the KDE scan and OCR program. -- k -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]