Re: Cleaning up dirty OCR

Robert Muir Tue, 09 Mar 2010 11:36:42 -0800

> Can anyone suggest any practical solutions to removing some fraction of the 
> tokens containing OCR errors from our input stream?


one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


-- 
Robert Muir
rcm...@gmail.com

Re: Cleaning up dirty OCR

Reply via email to