Sunday, December 8, 2002, 10:20:01 PM, Gary Miller wrote: [re: algorithm to determine language in 20 bytes or less] GM> I wanted to use this in my bot so that if a user type in French, German, GM> etc... My bot could say sorry I don't speak French instead of resorting GM> to a bluff.
As a quick and dirty method for checking language, counting trigram frequencies might work. A trigram is a specific sequence of three letters; just as "e" is the most common letter in the english language, certain trigrams occur more often, others less, and the specific distribution varies from language to language. E.g., "cht" is more common in german (than in english), "cce" in italian and "eau" in french. Some public domain dictionaries, a few probability formulas and you're on your way. Google around a bit for "trigram frequencies" and the like, it's often used in cryptography; http://web.mit.edu/craighea/www/ldetect/ might help. GM> I also do a spell check in my bot against a dictionary sorted by the GM> usage frequency of the words and get a list of possible replacement GM> words based on Levenschtein Distance. Soundex is bad if the first GM> character of the word is wrong or if the user transposes two letters GM> that give the word a different phonetic sound. For each possible GM> correction I have to resubmit the potentially corrected input back GM> through my pattern matcher. To minimize response time and improve GM> scaleability it would be optimal to know what the probability of a word GM> occuring was if you knew the previous word or words was correct. GM> Then I could sort the replacement words by the probability they occur GM> after the prior correctly spelled word. This would allow me to get the GM> correct word most of the time on the first or second try instead of the GM> several tries it takes me now. I'm sure this problem has been tackled before...(maybe in handwriting or speech recognition application areas?) you might try something similar to trigrams or digrams but word- instead of letter-based. Plenty of public domain text (ebooks, usenet, web pages) to build statistics from. The database might end up kinda large, though. -- Cliff ------- To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/?[EMAIL PROTECTED]