I downloaded Walter Parquhar Hook's 1842 Church Dictionary <http://www.archive.org/details/ChurchDictionary> from the Internet Archive and tried OCRing some text from it, using free software. I didn't have a lot of success, but success looks tantalizingly close.
I used DjView to extract the first page that has actual text on it. gOCR ---- gOCR renders the first four lines of the sample book, as output by DjView, more or less as follows: __E stronges_ __ecommendation o_ the _olIo_- @g _orb [EMAIL PROTECTED] @@ the statemen_ {_f @s [EMAIL PROTECTED] _oR the __ost paRt, me,Rely a [EMAIL PROTECTED]@n; a__d tb@ _eneraI ac___o_ledgment rendel_s @ unnecessarY It actually reads: THE strongest recommendation of the follow- ing Work consists in the statement of its being, for the most part, merely a Compilation; and this general acknowledgment renders it unnecessary A second try, using the command-line "gocr -C '- abcdefghijklmnopqrstuvwxyz,;ABCDEFGHIJKLMNOPQRSTUVWXYZ.' ChurchDictionary0004.pbm" yielded, after 100 seconds of CPU time, the following results: __E stronges_ __ecommendation o_ the _olIo__ \code(011d)ng _orb consìsts ìn the statemen_ i_f ìts beîngg _oR the __ost paRt, me,Rely a Compìlatìon; a__d tbìs _eneraI ac___o_ledgment rendel_s Ãt unnecessarY Ocrad ----- (I don't remember what version of Ocrad this was --- probably 0.12, but definitely not 0.13.) Ocrad took only 19 seconds and produced the following results: rHE strongest _.ecommen_ation of _he fo_lo__ ing Work consists iA the staLe_ent l_f its being, for the _oost yart, _oe,rely a Co__iI_tion; al_d _his gen_ral achl_o_ledg_oent rende__s ié unnecessary Upon being told that it was trying to recognize ASCII (-c ascii), it produced: rHE strongest _.ecommen_ation of _he fo_lo__ ing Work consists iA the staLe_ent l_f its being, for the _oost yart, _oe,rely a Co__iI_tion; al_d _his gen_ral achl_o_ledg_oent rende__s it unnecessary That's nearly good enough to be corrected with a dictionary. gOCR did better on the second "the", "in", "statement", "part", "merely", "compilation", "and", "this", and "acknowledgment" --- 9 of the 29 words, while Ocrad did better on the first "the", "strongest", "of", "Work", "consists", "its", "being,", "for", "the", "this", and "unnecessary", 12 of the 29. ClaraOCR -------- It took me a long time to find ClaraOCR, because the web site has been stolen. I couldn't figure out how to get ClaraOCR to produce OCR output at first; the answer is to train it, iterating the training until you get an acceptable output. This is a slow process, and doesn't produce good results even after considerable effort (the recognizer is kind of dumb, and there's apparently no way to correct the segmentation into symbols), but it seems that it can ultimately produce better results than the alternative methods. I was eventually able to train it to get the first few lines mostly correct: PRE[0]FACE. T HE strongest r ecommendation of the follow- ing Work consists in t he statement of its being, for the most part, merely a Compi, an[104]d th lation . is g eneral acknowledgment renders it unnecessary The next four lines looked like this: [216]o meri[227]ro[204] [229]he [231]a[219]io[201]is so[207][208]ce[210][194] ro[214] [215]h i[202]h it as ee[245] [258]o[246] r ed [262][269]ra[264]ts ia[235][265]. [266]e[253] o teii [240]ade almost [293]or o[296] [297]or roiri so[305]e of o[281]i[283] eates[278] r[320]irie[331][318], a[341]d t e [352]o[354] i er as ee[322] sorrre[350]i[336][319][351][337] ce[325] They actually read as follows: to mention the various sources from which it has been compiled. Extracts have been often made almost word for word from some of our greatest Divines, and the compiler has been sometimes cen- Slightly better segmentation and a dictionary would help considerably here. "meri.ro." would probably be "men.io." with better segmentation, and /usr/share/dict/words has only one possibility for that; likewise ".a.io.is" occurs only as "raviolis", and would probably be ".a.ious" with better segmentation --- yielding possibilities "carious", "sanious", and "various". My frequency analysis of the British National Corpus <http://pobox.com/~kragen/sw/wordlist> found "various" 15503 times, "carious" 7 times, and "sanious" and "raviolis" less than 5 times, so it should be pretty easy to pick the right one. This is after 140 cycles of recognition, which took nearly an hour. ClaraOCR also includes some code for cooperative web-based OCR, but I haven't tried it yet. ClaraOCR has a user interface that makes its OCR process dramatically more transparent than gOCR's and Ocrad's, so I feel better about it than the above miserable performance would lead one to expect. ocre ---- ocre wouldn't work on the ASCII PBM, claiming it wasn't a PBM, only the binary PBM; eventually it popped up a bunch of windows to get help with letters it was having trouble with. It seemed to be having trouble with a lot of letters, so I eventually gave up. DjVu ---- DjVu isn't in the same category as ocre, ClaraOCR, gOCR, and Ocrad, but it's important. The DjVuLibre tools provide powerful free-software compression algorithms for handling page images, particularly bilevel page images, and DjView is a far better document reader than xpdf, ghostview, gv, or xdvi, because it's usually instantly responsive, supports copy and paste, and supports text search. It isn't presently the case that DjVu has a good free encoder for scanned data. Presumably that will eventually change, after it becomes important. In the meantime, DjVu will be important for several reasons: - the Internet Archive is releasing a large volume of public-domain page images as part of their Million Books Project in DjVu format, but presently have no OCR data for them. - DjVu files can take advantage of OCR output for copying and full-text search. The free-software "djvused" command can add OCR output to existing DjVu files without re-encoding the images. - DjVu files may be a useful format for acquiring training and evaluation data for OCR programs, because unlike every widely-used file format, they contain both the raw (possibly compressed) page images, and "ground truth" OCR output. Consequently the adoption of DjVu will make OCR training and evaluation data much easier to come by than in the past. Directions for improvement -------------------------- Better UI: ClaraOCR has spent the largest amount of work on its user interface, but despite all that, it's far from obvious how to get any output at all, and entering the correct transliteration for a single five-letter word requires five mouse clicks or five arrow-key presses; and classifying a "symbol" that its segmenter has discovered as "noise" requires many more mouse clicks. Most of the improvements suggested below could also be used to dramatically speed up the training process. Dictionaries: all 26 of the distinct words in the text segment I tested with occur in my word list mentioned above; the least frequent are "ing", with 123 occurrences, and "acknowledgment" with 96 (the modern spelling "acknowledgement" has 554.) (The total of all frequencies therein is 90080933, a little over 90 million, but that doesn't include the words that occurred fewer than five times.) This suggests that a little bit of fixing up based on language-specific frequencies could help a lot. Combination: it is at least possible that, for example, gOCR and Ocrad together could produce better results than either alone, since each excelled the other on certain words. Ambiguous output: if the text output is to be used to reformat the scanned text (for example, for columns of a different width, for a device with less storage space, or for a text reader) it is obviously essential to choose the single most likely reading of the text; but if OCR is being used merely to make a set of images searchable, "f?ro(m|iri)" is a perfectly good transliteration of "from", which ClaraOCR rendered above as "roiri". Better algorithms: the text I was OCRing is eminently readable to human eyes, despite being slightly askew and printed with somewhat worn and dented type. It's absurd that gOCR could only correctly recognize "a" and a couple of instances of "the" in the original text; that Ocrad got only 11 of the 29 words correct, even without a dictionary; and that after an hour of training of ClaraOCR, it had a similar success rate to (untrained!) gOCR on the next four lines. Better evaluation data: a standard OCR corpus for evaluating and training software would probably help a lot. There's the ISRI OCRtk OCR Performance Toolkit, which contains "a large and diverse corpus of 280 scanned page images with corresponding ground-truth text," but it's not clear whether it's free software.