Hello all,

We have been indexing a large collection of OCR'd text. About 5 million books 
in over 200 languages.  With 1.5 billion OCR'd pages, even a small OCR error 
rate creates a relatively large number of meaningless unique terms.  (See  
http://www.hathitrust.org/blogs/large-scale-search/too-many-words )

We would like to remove some *fraction* of these nonsense words caused by OCR 
errors prior to indexing. ( We don't want to remove "real" words, so we need 
some method with very few false positives.)

A dictionary based approach does not seem feasible given the number of 
languages and the inclusion of proper names, place names, and technical terms.  
 We are considering using some heuristics, such as looking for strings over a 
certain length or strings containing more than some number of punctuation 
characters.

This paper has a few such heuristics:
Kazem Taghva, Tom Nartker, Allen Condit, and Julie Borsack. Automatic Removal 
of ``Garbage Strings'' in OCR Text: An Implementation. In The 5th World 
Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida, 
July 2001. http://www.isri.unlv.edu/publications/isripub/Taghva01b.pdf

Can anyone suggest any practical solutions to removing some fraction of the 
tokens containing OCR errors from our input stream?

Tom Burton-West
University of Michigan Library
www.hathitrust.org

Reply via email to