> Can anyone suggest any practical solutions to removing some fraction of the 
> tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


-- 
Robert Muir
rcm...@gmail.com

Reply via email to