: Interesting.  I wonder though if we have 4 million English documents and 250
: in Urdu, if the Urdu words would score badly when compared to ngram
: statistics for the entire corpus.  

Well it doesn't have to be a strict ratio cutoff .. you could look at the 
average frequency of all character Ngrams in your index, and then 
consider any Ngram that appeared fewer then X stddev's below the average 
to be suspicious, and eliminate any work that contains Y or more 
suspicious Ngrams.

Of you could just start really simple and eliminate any work that contains 
an Ngram that doesn't appear in *any* other word in your corpus.

I don't deal with a lot of multi-lingual stuff, but my understanding is 
that this sort of thing gets a lot easier if you can partition your docs 
by language -- and even if you can't, doing some langauge detection on the 
(dirty) OCRed text to get a language guess (and then partition by language 
and attempt to find the suspicious words in each partition)


-Hoss

Reply via email to