On Tue, Nov 27, 2012 at 4:05 PM, Neil Cerutti <ne...@norwich.edu> wrote: > On 2012-11-27, Anatoli Hristov <toli...@gmail.com> wrote: >> Thanks for your help. I will do my best for the forum :) >> >> I advanced a little bit with the algorithm and at least I can >> now extract and compare the fields :) For my beginner skills I >> think this is too much for me. Now next step is to add the >> second field with the number to the Namelist and copy it to a >> third filename I suppose. > > I had to write a similar type of program, and I imagine it's a > common problem. Sometimes new students provide incorrect SSN's or > simply leave them blank. This makes it impossible for us to match > their application for financial aid to their admissions record. > > You have to analyze how you're going to match records. > > In my case, missing SSN's are one case. A likeley match in this > case is when the names are eerily similar. > > In the other case, where they simply got their SSN wrong, I have > to check for both a similar SSN and a similar name. > > But you still have to define "similar." I looked up an algorithm > on the web called Levenshtein Distance, and implemented it like > so. > > def levenshteindistance(first, second): > """Find the Levenshtein distance between two strings.""" > if len(first) > len(second): > first, second = second, first > if len(second) == 0: > return len(first) > first_length = len(first) + 1 > second_length = len(second) + 1 > distance_matrix = [[0] * second_length for x in range(first_length)] > for i in range(first_length): > distance_matrix[i][0] = i > for j in range(second_length): > distance_matrix[0][j]=j > for i in range(1, first_length): > for j in range(1, second_length): > deletion = distance_matrix[i-1][j] + 1 > insertion = distance_matrix[i][j-1] + 1 > substitution = distance_matrix[i-1][j-1] > if first[i-1] != second[j-1]: > substitution += 1 > distance_matrix[i][j] = min(insertion, deletion, substitution) > return distance_matrix[first_length-1][second_length-1] > > The algorithm return a count of every difference between the two > strings, from 0 to the length of the longest string. > > Python provides difflib, which implements a similar algorithm, so > I used that as well (kinda awkwardly). I used > difflib.get_close_matches to get candidates, and then > difflib.SequenceMatcher to provide me a score measuring the > closeness. > > matches = difflib.get_close_matches(s1, s2) > for m in matches: > scorer = difflib.SequenceMatcher(None, s1, m) > ratio = scorer.ratio() > if ratio == 0.0: > # perfect match > if ratio > MAX_RATIO: # You gotta choose this. I used 0.1 > # close match > > The two algorithms come up with different guesses, and I pass on > their suggestions for fixes to a human being. Both versions of > the program take roughly 5 minutes to run the comparison on > 2000-12000 records between the two files. > > I like the results of Levenshtein distance a little better, but > difflib finds some stuff that it misses. > > In your case, the name is munged horribly in one of the files so > you'll first have to first sort it out somehow. > > -- > Neil Cerutti > -- > http://mail.python.org/mailman/listinfo/python-list
Thank you all for the help, but I figured that out and the program now works perfect. I would appreciate if you have some notes about my script as I'm noob :) Here is the code: import csv origf = open('c:/Working/Test_phonebook.csv', 'rt') secfile = open('c:/Working/phones.csv', 'rt') phonelist = [] namelist = [] names = csv.reader(origf, delimiter=';') phones = csv.reader(secfile, delimiter=';') for tel in phones: phonelist.append(tel) def finder(name_row,rows): for ex_phone in phonelist: telstr = ex_phone[0].lower() if telstr.find(name_row) >= 0: print "\nName found: %s" % name_row namelist[rows][-1] = ex_phone[-1].lower() else: pass return def name_find(): rows = 0 for row in names: namelist.append(row) name_row = row[0].lower() finder(name_row,rows) rows = rows+1 name_find() ofile = open('c:/Working/ttest.csv', "wb") writer = csv.writer(wfile, delimiter=';') for insert in namelist: writer.writerow(insert) wfile.close() -- http://mail.python.org/mailman/listinfo/python-list