I have some CDs and have been archiving them on a PC. I wrote a Python script that spans the archive and returns a list of its contents: [[genre, artist, album, song]...]. I wanted to add a search function to locate all the versions of a particular song. This is harder than you might think. For example the Cajun "national anthem" is Jolie Blond, but it can be spelled several different ways jolie, joli, blon, blond, etc... In addition the various online services that provide song info are riddled with typos, so an ordinary string match just doesn't get you there. What is needed is a fuzzy string match and it turns out that there is a very good one, the Levenshtein distance, which is the number of inserts, deletions and substitutions needed to morph one string into another. In my application I match the desired song title against all song titles in my list and return the ones with the lowest Levenshtein distances. This is remarkably, one might even say stunningly, effective, easily finding all the version of Jolie Blon in the list.
I am using the following snippet (found on the web, proper attribution unsure), which calculates the Levenshtein distance. def distance(a,b): c = {} n = len(a); m = len(b) for i in range(0,n+1): c[i,0] = i for j in range(0,m+1): c[0,j] = j for i in range(1,n+1): for j in range(1,m+1): x = c[i-1,j]+1 y = c[i,j-1]+1 if a[i-1] == b[j-1]: z = c[i-1,j-1] else: z = c[i-1,j-1]+1 c[i,j] = min(x,y,z) return c[n,m] As mentioned above this works quite well and I am happy with it, but I wonder if there is a more Pythonic way of doing this type of lookup? jab -- http://mail.python.org/mailman/listinfo/python-list