: We can probably implement your suggestion about runs of punctuation and : unlikely mixes of alpha/numeric/punctuation. I'm also thinking about : looking for unlikely mixes of unicode character blocks. For example some of : the CJK material ends up with Cyrillic characters. (except we would have to : watch out for any Russian-Chinese dictionaries:)
Since you are dealing with multiple langugaes, and multiple varient usages of langauges (ie: olde english) I wonder if one way to try and generalize the idea of "unlikely" letter combinations into a math problem (instead of grammer/spelling problem) would be to score all the hapax legomenon words in your index based on the frequency of (character) N-grams in each of those words, relative the entire corpus, and then eliminate any of the hapax legomenon words whose score is below some cut off threshold (that you'd have to pick arbitrarily, probably by eyeballing the sorted list of words and their contexts to deide if they are legitimate) ? -Hoss