: Interesting. I wonder though if we have 4 million English documents and 250 : in Urdu, if the Urdu words would score badly when compared to ngram : statistics for the entire corpus.
Well it doesn't have to be a strict ratio cutoff .. you could look at the average frequency of all character Ngrams in your index, and then consider any Ngram that appeared fewer then X stddev's below the average to be suspicious, and eliminate any work that contains Y or more suspicious Ngrams. Of you could just start really simple and eliminate any work that contains an Ngram that doesn't appear in *any* other word in your corpus. I don't deal with a lot of multi-lingual stuff, but my understanding is that this sort of thing gets a lot easier if you can partition your docs by language -- and even if you can't, doing some langauge detection on the (dirty) OCRed text to get a language guess (and then partition by language and attempt to find the suspicious words in each partition) -Hoss