> Can anyone suggest any practical solutions to removing some fraction of the > tokens containing OCR errors from our input stream?
one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. -- Robert Muir rcm...@gmail.com