Re: Cleaning up dirty OCR

Walter Underwood Thu, 11 Mar 2010 13:50:25 -0800

On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote:

> I wonder if one way to try and generalize 
> the idea of "unlikely" letter combinations into a math problem (instead of 
> grammer/spelling problem) would be to score all the hapax legomenon 
> words in your index



Hmm, how about a classifier? Common words are the "yes" training set, hapax 
legomenons are the "no" set, and n-grams are the features.

But why isn't the OCR program already doing this?

wunder

Re: Cleaning up dirty OCR

Reply via email to