Hello all, We have been indexing a large collection of OCR'd text. About 5 million books in over 200 languages. With 1.5 billion OCR'd pages, even a small OCR error rate creates a relatively large number of meaningless unique terms. (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words )
We would like to remove some *fraction* of these nonsense words caused by OCR errors prior to indexing. ( We don't want to remove "real" words, so we need some method with very few false positives.) A dictionary based approach does not seem feasible given the number of languages and the inclusion of proper names, place names, and technical terms. We are considering using some heuristics, such as looking for strings over a certain length or strings containing more than some number of punctuation characters. This paper has a few such heuristics: Kazem Taghva, Tom Nartker, Allen Condit, and Julie Borsack. Automatic Removal of ``Garbage Strings'' in OCR Text: An Implementation. In The 5th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida, July 2001. http://www.isri.unlv.edu/publications/isripub/Taghva01b.pdf Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? Tom Burton-West University of Michigan Library www.hathitrust.org