Re: Cleaning up dirty OCR

Chris Hostetter Thu, 11 Mar 2010 13:35:20 -0800

: We can probably implement your suggestion about runs of punctuation and
: unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
: looking for unlikely mixes of unicode character blocks.  For example some of
: the CJK material ends up with Cyrillic characters. (except we would have to
: watch out for any Russian-Chinese dictionaries:)


Since you are dealing with multiple langugaes, and multiple varient usages 
of langauges (ie: olde english) I wonder if one way to try and generalize 
the idea of "unlikely" letter combinations into a math problem (instead of 
grammer/spelling problem) would be to score all the hapax legomenon 
words in your index based on the frequency of (character) N-grams in 
each of those words, relative the entire corpus, and then eliminate any of 
the hapax legomenon words whose score is below some cut off threshold 
(that you'd have to pick arbitrarily, probably by eyeballing the sorted 
list of words and their contexts to deide if they are legitimate)

        ?


-Hoss

Re: Cleaning up dirty OCR

Reply via email to