Chris Croughton wrote: > On Thu, 19 May 2005 06:38:59 +1000, John Machin > <[EMAIL PROTECTED]> wrote: > > >>On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <[EMAIL PROTECTED]> >>wrote: >> >> >>>William Park wrote: >>> >>> >>>>How do you compare 2 strings, and determine how much they are "close" to >>>>each other? Eg. >>>> aqwerty >>>> qwertyb >>>>are similar to each other, except for first/last char. But, how do I >>>>quantify that? >>> >>>"However you like" is probably the right answer, but one way might be to >>>compare their soundex encoding >>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out >>>percentage difference based on comparing the numeric part. >> >>Fantastic suggestion. Here's a tiny piece of real-life test data: >> >>compare the surnames "Mousaferiadis" and "McPherson". > > > I remember a word processor, MultiMate, which used Soundex to do > matching for its suggestions for spelling correction. One of my > cow-orkers typed the word 'agains' -- it was supposed to be 'against' > but 'again' would also have been a sensible suggestion. MultiMate, > however, suggested neither of those reasonable words, it did suggest > "iguanas" amd "Utahns"... > > (I wonder what it does with "Talliafero" and "Tolliver", and with > "Featherstone-Howe" and "Fanshaw"...) > > The answer to the OP, which someone just pointed out to me on > comp.programming, is "edit distance" or "Levenshtein Distance"[1]. A > google search on either produces some good descriptions in the first few > links, including http://www.merriampark.com/ld.htm which has not only a > description of the algorithm but also source code in Java, C++ and > Visual Basic (no Awk, thought there are links to pages with > implementations in Perl, Python, Delphi and many more)... > > [1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein' > in Schwaebisch (south German) fashion, but apparently the author of the > algorithm was one Vladimir I. Levenshtein and that is how he is credited > on the IEEE site. There are also a number of Google hits under the > 'stein' spelling, though... > > Chris C
And what's the Levenshtein distance between "levenshtein" and "levenstein"? Ironic if it were large ... regards Steve -- Steve Holden +1 703 861 4237 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Python Web Programming http://pydish.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list