Sunday, December 8, 2002, 10:20:01 PM, Gary Miller wrote:

[re: algorithm to determine language in 20 bytes or less]
GM> I wanted to use this in my bot so that if a user type in French, German,
GM> etc... My bot could say sorry I don't speak French instead of resorting
GM> to a bluff.

As a quick and dirty method for checking language, counting trigram
frequencies might work.  A trigram is a specific sequence of three
letters; just as "e" is the most common letter in the english
language, certain trigrams occur more often, others less, and the
specific distribution varies from language to language.  E.g.,
"cht" is more common in german (than in english), "cce" in italian and
"eau" in french.  Some public domain dictionaries, a few probability
formulas and you're on your way.  Google around a bit for "trigram
frequencies" and the like, it's often used in cryptography;
http://web.mit.edu/craighea/www/ldetect/ might help.
 
GM> I also do a spell check in my bot against a dictionary sorted by the
GM> usage frequency of the words and get a list of possible replacement
GM> words based on Levenschtein Distance.  Soundex is bad if the first
GM> character of the word is wrong or if the user transposes two letters
GM> that give the word a different phonetic sound.  For each possible
GM> correction I have to resubmit the potentially corrected input back
GM> through my pattern matcher.  To minimize response time and improve
GM> scaleability it would be optimal to know what the probability of a word
GM> occuring was if you knew the previous word or words was correct.
GM> Then I could sort the replacement words by the probability they occur
GM> after the prior correctly spelled word. This would allow me to get the
GM> correct word most of the time on the first or second try instead of the
GM> several tries it takes me now.  

I'm sure this problem has been tackled before...(maybe in handwriting
or speech recognition application areas?)  you might try something
similar to trigrams or digrams but word- instead of letter-based.
Plenty of public domain text (ebooks, usenet, web pages) to build
statistics from.  The database might end up kinda large, though.

--
Cliff

-------
To unsubscribe, change your address, or temporarily deactivate your subscription, 
please go to http://v2.listbox.com/member/?[EMAIL PROTECTED]

Reply via email to