On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir <rcm...@gmail.com> wrote:

> > Can anyone suggest any practical solutions to removing some fraction of
> the tokens containing OCR errors from our input stream?
>
> one approach would be to try
> http://issues.apache.org/jira/browse/LUCENE-1812
>
> and filter terms that only appear once in the document.
>

In another life (and with another search engine) I also had to find a
solution to the dirty OCR problem. Fortunately only in English,
unfortunately a corpus containing many non-American/non-English names, so we
also had to be very conservative and reduce the number of false positives.

There wasn't any completely satisfactory solution; there were a large number
of two and three letter n-grams so we were able to use a dictionary approach
to eliminate those (names tend to be longer).  We also looked for runs of
punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
eliminated longer words which consisted of runs of not-ocurring-in-English
bigrams.

Hope this helps

-Simon

>
> --
>

Reply via email to