On 2012-11-27, Anatoli Hristov <toli...@gmail.com> wrote: > Thanks for your help. I will do my best for the forum :) > > I advanced a little bit with the algorithm and at least I can > now extract and compare the fields :) For my beginner skills I > think this is too much for me. Now next step is to add the > second field with the number to the Namelist and copy it to a > third filename I suppose.
I had to write a similar type of program, and I imagine it's a common problem. Sometimes new students provide incorrect SSN's or simply leave them blank. This makes it impossible for us to match their application for financial aid to their admissions record. You have to analyze how you're going to match records. In my case, missing SSN's are one case. A likeley match in this case is when the names are eerily similar. In the other case, where they simply got their SSN wrong, I have to check for both a similar SSN and a similar name. But you still have to define "similar." I looked up an algorithm on the web called Levenshtein Distance, and implemented it like so. def levenshteindistance(first, second): """Find the Levenshtein distance between two strings.""" if len(first) > len(second): first, second = second, first if len(second) == 0: return len(first) first_length = len(first) + 1 second_length = len(second) + 1 distance_matrix = [[0] * second_length for x in range(first_length)] for i in range(first_length): distance_matrix[i][0] = i for j in range(second_length): distance_matrix[0][j]=j for i in range(1, first_length): for j in range(1, second_length): deletion = distance_matrix[i-1][j] + 1 insertion = distance_matrix[i][j-1] + 1 substitution = distance_matrix[i-1][j-1] if first[i-1] != second[j-1]: substitution += 1 distance_matrix[i][j] = min(insertion, deletion, substitution) return distance_matrix[first_length-1][second_length-1] The algorithm return a count of every difference between the two strings, from 0 to the length of the longest string. Python provides difflib, which implements a similar algorithm, so I used that as well (kinda awkwardly). I used difflib.get_close_matches to get candidates, and then difflib.SequenceMatcher to provide me a score measuring the closeness. matches = difflib.get_close_matches(s1, s2) for m in matches: scorer = difflib.SequenceMatcher(None, s1, m) ratio = scorer.ratio() if ratio == 0.0: # perfect match if ratio > MAX_RATIO: # You gotta choose this. I used 0.1 # close match The two algorithms come up with different guesses, and I pass on their suggestions for fixes to a human being. Both versions of the program take roughly 5 minutes to run the comparison on 2000-12000 records between the two files. I like the results of Levenshtein distance a little better, but difflib finds some stuff that it misses. In your case, the name is munged horribly in one of the files so you'll first have to first sort it out somehow. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list