Thanks Simon, We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation. I'm also thinking about looking for unlikely mixes of unicode character blocks. For example some of the CJK material ends up with Cyrillic characters. (except we would have to watch out for any Russian-Chinese dictionaries:)
Tom > > > There wasn't any completely satisfactory solution; there were a large > number > of two and three letter n-grams so we were able to use a dictionary > approach > to eliminate those (names tend to be longer). We also looked for runs of > punctuation, unlikely mixes of alpha/numeric/punctuation, and also > eliminated longer words which consisted of runs of not-ocurring-in-English > bigrams. > > Hope this helps > > -Simon > >> >> -- >> > > -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27869940.html Sent from the Solr - User mailing list archive at Nabble.com.