Ok, ok, I got it! The Pythonic way is to use an existing library ;-) import difflib CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7)
I wrote a script to delete duplicate mp3's by filename a few years back with this. If anyone's interested in seeing it, I'll post a blog entry on it. I'm betting it uses a similiar algorithm your functions. -Greg On 30 Jan 2006 17:28:47 -0800, ajones <[EMAIL PROTECTED]> wrote: > > BBands wrote: > > I have some CDs and have been archiving them on a PC. I wrote a Python > > script that spans the archive and returns a list of its contents: > > [[genre, artist, album, song]...]. I wanted to add a search function to > > locate all the versions of a particular song. This is harder than you > > might think. For example the Cajun "national anthem" is Jolie Blond, > > but it can be spelled several different ways jolie, joli, blon, blond, > > etc... In addition the various online services that provide song info > > are riddled with typos, so an ordinary string match just doesn't get > > you there. What is needed is a fuzzy string match and it turns out that > > there is a very good one, the Levenshtein distance, which is the number > > of inserts, deletions and substitutions needed to morph one string into > > another. In my application I match the desired song title against all > > song titles in my list and return the ones with the lowest Levenshtein > > distances. This is remarkably, one might even say stunningly, > > effective, easily finding all the version of Jolie Blon in the list. > > > > I am using the following snippet (found on the web, proper attribution > > unsure), which calculates the Levenshtein distance. > > > > def distance(a,b): > > c = {} > > n = len(a); m = len(b) > > > > for i in range(0,n+1): > > c[i,0] = i > > for j in range(0,m+1): > > c[0,j] = j > > > > for i in range(1,n+1): > > for j in range(1,m+1): > > x = c[i-1,j]+1 > > y = c[i,j-1]+1 > > if a[i-1] == b[j-1]: > > z = c[i-1,j-1] > > else: > > z = c[i-1,j-1]+1 > > c[i,j] = min(x,y,z) > > return c[n,m] > > > > As mentioned above this works quite well and I am happy with it, but I > > wonder if there is a more Pythonic way of doing this type of lookup? > > > > jab > > Here is my stab at it, didn't fully test it so it may not work > correctly. The tuples could be avoided by using len(x), but the extra > lookups may cost in execution speed[1]. > > def distance(a, b): > """ > Computes the levenshtein distance between two strings > """ > m, n = (len(a),a), (len(b),b) > if(m[0] < n[0]): #ensure that the 'm' tuple holds > the longest string > m, n = n, m > dist = m[0] #assume distance = length of > longest string (worst case) > for i in range(0, n[0]): # reduce the distance for each char > match in shorter string > if m[1][i] == n[1][i]: > dist = dist - 1 > return dist > > [1] I have no if this is true or not. I can see arguments for both. > > -- > http://mail.python.org/mailman/listinfo/python-list > -- Gregory Piñero Chief Innovation Officer Blended Technologies (www.blendedtechnologies.com) -- http://mail.python.org/mailman/listinfo/python-list