Jeff Schwab wrote: > Tim Chase wrote: >>> Are there any Python libraries implementing measurement of similarity >>> of two strings of Latin characters? >> It sounds like you're interested in calculating the Levenshtein distance: >> >> http://en.wikipedia.org/wiki/Levenshtein_distance >> >> which gives you a measure of how different they are. A measure of "0" >> is that the inputs are the same. The more different the two strings >> are, the greater the resulting output of the function. >> >> Unfortunately, it's an O(MN) algorithm (where M=len(word1) and >> N=len(word2)) from my understanding of the code I've seen. However it >> really is the best approximation I've seen of a "how similar are these >> two strings" function. Googling for >> >> python levenshtein distance >> >> brings up oodles of hits. > > If the strings happen to be the same length, the Levenshtein distance is > equivalent to the Hamming distance. The Wikipedia article gives the > following Python implementation: > > # http://en.wikipedia.org/wiki/Hamming_distance > def hamdist(s1, s2): > assert len(s1) == len(s2) > return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
I'm afraid that it isn't. Using Magnus Lie Hetland's implementation: http://hetland.org/coding/python/levenshtein.py In [1]: %cpaste Pasting code; enter '--' alone on the line to stop. :def levenshtein(a,b): : "Calculates the Levenshtein distance between a and b." : n, m = len(a), len(b) : if n > m: : # Make sure n <= m, to use O(min(n,m)) space : a,b = b,a : n,m = m,n : : current = range(n+1) : for i in range(1,m+1): : previous, current = current, [i]+[0]*n : for j in range(1,n+1): : add, delete = previous[j]+1, current[j-1]+1 : change = previous[j-1] : if a[j-1] != b[i-1]: : change = change + 1 : current[j] = min(add, delete, change) : : return current[n] :-- In [2]: In [3]: def hamdist(s1, s2): ...: assert len(s1) == len(s2) ...: return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2)) ...: In [4]: hamdist('abcdef', 'fabcde') Out[4]: 6 In [5]: levenshtein('abcdef', 'fabcde') Out[5]: 2 -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list