Thanks Simon,

We can probably implement your suggestion about runs of punctuation and
unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
looking for unlikely mixes of unicode character blocks.  For example some of
the CJK material ends up with Cyrillic characters. (except we would have to
watch out for any Russian-Chinese dictionaries:)

Tom



> 
> 
> There wasn't any completely satisfactory solution; there were a large
> number
> of two and three letter n-grams so we were able to use a dictionary
> approach
> to eliminate those (names tend to be longer).  We also looked for runs of
> punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
> eliminated longer words which consisted of runs of not-ocurring-in-English
> bigrams.
> 
> Hope this helps
> 
> -Simon
> 
>>
>> --
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27869940.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to