We've been thinking about running some kind of a classifier against each book
to select books with a high percentage of dirty OCR for some kind of special
processing.  Haven't quite figured out a multilingual feature set yet other
than the punctuation/alphanumeric and character block ideas mentioned above.   

I'm not sure I understand your suggestion. Since real word hapax legomenons
are generally pretty common (maybe 40-60% of unique words) wouldn't  using
them as the "no" set provide mixed signals to the classifier?

Tom


Walter Underwood-2 wrote:
> 
> 
> Hmm, how about a classifier? Common words are the "yes" training set,
> hapax legomenons are the "no" set, and n-grams are the features.
> 
> But why isn't the OCR program already doing this?
> 
> wunder
> 
> 
> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to