On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir <rcm...@gmail.com> wrote:
> > Can anyone suggest any practical solutions to removing some fraction of > the tokens containing OCR errors from our input stream? > > one approach would be to try > http://issues.apache.org/jira/browse/LUCENE-1812 > > and filter terms that only appear once in the document. > In another life (and with another search engine) I also had to find a solution to the dirty OCR problem. Fortunately only in English, unfortunately a corpus containing many non-American/non-English names, so we also had to be very conservative and reduce the number of false positives. There wasn't any completely satisfactory solution; there were a large number of two and three letter n-grams so we were able to use a dictionary approach to eliminate those (names tend to be longer). We also looked for runs of punctuation, unlikely mixes of alpha/numeric/punctuation, and also eliminated longer words which consisted of runs of not-ocurring-in-English bigrams. Hope this helps -Simon > > -- >