Re: [Tutor] 1 to N searches in files
From: Dave Angel d...@davea.name To: Spectral None spectraln...@yahoo.com.sg Cc: tutor@python.org tutor@python.org Sent: Sunday, 2 December 2012, 20:05 Subject: Re: [Tutor] 1 to N searches in files On 12/02/2012 03:53 AM, Spectral None wrote: Hi all I have two files (File A and File B) with strings of data in them (each string on a separate line). Basically, each string in File B will be compared with all the strings in File A and the resulting output is to show a list of matched/unmatched lines and optionally to write to a third File C File A: Unique strings File B: Can have duplicate strings (that is, string1 may appear more than once) My code currently looks like this: - FirstFile = open('C:\FileA.txt', 'r') SecondFile = open('C:\FileB.txt', 'r') ThirdFile = open('C:\FileC.txt', 'w') a = FirstFile.readlines() b = SecondFile.readlines() mydiff = difflib.Differ() results = mydiff(a,b) print(\n.join(results)) #ThirdFile.writelines(results) FirstFile.close() SecondFile.close() ThirdFile.close() - However, it seems that the results do not correctly reflect the matched/unmatched lines. As an example, if FileA contains string1 and FileB contains multiple occurrences of string1, it seems that the first occurrence matches correctly but subsequent string1s are treated as unmatched strings. I am thinking perhaps I don't understand Differ() that well and that it is not doing what I hoped to do? Is Differ() comparing first line to first line and second line to second line etc in contrast to what I wanted to do? Regards Let me guess your goal, and then, on that assumption, discuss your code. I think your File A is supposed to be a dictionary of valid words (strings). You want to process File B, checking each line against that dictionary, and make a list of which lines are valid (in the dictionary), and another of which lines are not (missing from the dictionary). That's one list for matched lines, and one for unmatched. That isn't even close to what difflib does. This can be solved with minimal code, but not by starting with difflib. What you should do is to loop through File A, adding all the lines to a set called valid_dictionary. Calling set(FirstFile) can do that in one line, without even calling readlines(). Then a simple loop can build the desired lists. The matched_lines is simply all lines which are in the dictionary, while unmatched_lines are those which are not. The heart of the comparison could simply look like: if line in valid_dictionary: matched_lines.append(line) else: unmatched_lines.append(line) -- DaveA - Hi Dave Your solution seems to work: setA = set(FileA) setB = set(FileB) for line in setB: if line in setA: matched_lines.writelines(line) else: non_matched_lines.writelines(line) There are no duplicates in the results as well. Thanks for helping out Regards___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] 1 to N searches in files
multiple occurrences of string1, it seems that the first occurrence matches correctly but subsequent string1s are treated as unmatched strings. I am thinking perhaps I don't understand Differ() that well and that it is not doing what I hoped to do? Is Differ() comparing first line to first line and second line to second line etc in contrast to what I wanted to do? Regards -- next part -- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/tutor/attachments/20121202/6507cbad/attachment-0001.html -- Message: 6 Date: Sun, 02 Dec 2012 20:34:24 +1100 From: Steven D'Aprano st...@pearwood.info To: tutor@python.org Subject: Re: [Tutor] 1 to N searches in files Message-ID: 50bb20a0.5090...@pearwood.info Content-Type: text/plain; charset=UTF-8; format=flowed On 02/12/12 19:53, Spectral None wrote: However, it seems that the results do not correctly reflect the matched/unmatched lines. As an example, if FileA contains string1 and FileB contains multiple occurrences of string1, it seems that ? the first occurrence matches correctly but subsequent string1s are treated as unmatched strings. I am thinking perhaps I don't understand Differ() that well and that ? it is not doing what I hoped to do? Is Differ() comparing first line ? to first line and second line to second line etc in contrast to what ? I wanted to do? No, and yes. No, it is not comparing first line to first line. And yes, it is acting in contrast to what you hope to do, otherwise you wouldn't be asking the question :-) Unfortunately, you don't explain what it is that you hope to do, so I'm going to have to guess. See below. difflib is used for find differences between two files. It will try to find a set of changes which will turn file A into file B, e.g: insert this line here delete this line there ... and repeated as many times as needed. Except that difflib.Differ uses a shorthand of + and - to indicate adding and deleting lines. You can find out more about difflib and Differ objects by reading the Fine Manual. Open a Python interactive shell, and do this: import difflib help(difflib.Differ) If you have any questions, please feel free to ask. In the code sample you give, you say you do this: mydiff = difflib.Differ() results = mydiff(a,b) but that doesn't work, Differ objects are not callable. Please do not paraphrase your code. Copy and paste the exact code you have actually run, don't try to type it out from memory. Now, I *guess* that what you are trying to do is something like this... given files A and B: # file A spam ham eggs tomato # file B tomato spam eggs cheese spam spam you want to generate three lists: # lines in B that were also in A: tomato spam eggs # lines in B that were not in A: cheese # lines in A that were not found in B: ham Am I close? If not, please explain with an example what you are trying to do. -- Steven -- Hi Steven I was searching for strings comparison and saw this article and decided to try it. There was no error when I ran the code (http://stackoverflow.com/questions/11008519/detecting-and-printing-the-difference-between-two-text-files-using-python-3-2) In another reply by Dave about matching list of valid words, that is similar to what I want to do. I guess I probably misunderstood the usage of Differ(). Thanks for the help! Regards Hi Steven My apologies as well for not being clear in my explanation. Regards___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] 1 to N searches in files
On 12/03/2012 10:46 AM, Spectral None wrote: snip Hi Dave Your solution seems to work: setA = set(FileA) setB = set(FileB) for line in setB: if line in setA: matched_lines.writelines(line) else: non_matched_lines.writelines(line) There are no duplicates in the results as well. Thanks for helping out You didn't specify whether you wanted dups to be noticed. You had said that there were none in A, but B was unspecified. The other question is order. If you want original order, you'd have to omit the setB step, and iterate on FileB. And then eliminate dups as you go. Or if you want sorted order without dups, you could simply iterate on sorted(setB). -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] 1 to N searches in files
On 02/12/12 19:53, Spectral None wrote: However, it seems that the results do not correctly reflect the matched/unmatched lines. As an example, if FileA contains string1 and FileB contains multiple occurrences of string1, it seems that the first occurrence matches correctly but subsequent string1s are treated as unmatched strings. I am thinking perhaps I don't understand Differ() that well and that it is not doing what I hoped to do? Is Differ() comparing first line to first line and second line to second line etc in contrast to what I wanted to do? No, and yes. No, it is not comparing first line to first line. And yes, it is acting in contrast to what you hope to do, otherwise you wouldn't be asking the question :-) Unfortunately, you don't explain what it is that you hope to do, so I'm going to have to guess. See below. difflib is used for find differences between two files. It will try to find a set of changes which will turn file A into file B, e.g: insert this line here delete this line there ... and repeated as many times as needed. Except that difflib.Differ uses a shorthand of + and - to indicate adding and deleting lines. You can find out more about difflib and Differ objects by reading the Fine Manual. Open a Python interactive shell, and do this: import difflib help(difflib.Differ) If you have any questions, please feel free to ask. In the code sample you give, you say you do this: mydiff = difflib.Differ() results = mydiff(a,b) but that doesn't work, Differ objects are not callable. Please do not paraphrase your code. Copy and paste the exact code you have actually run, don't try to type it out from memory. Now, I *guess* that what you are trying to do is something like this... given files A and B: # file A spam ham eggs tomato # file B tomato spam eggs cheese spam spam you want to generate three lists: # lines in B that were also in A: tomato spam eggs # lines in B that were not in A: cheese # lines in A that were not found in B: ham Am I close? If not, please explain with an example what you are trying to do. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] 1 to N searches in files
On 12/02/2012 03:53 AM, Spectral None wrote: Hi all I have two files (File A and File B) with strings of data in them (each string on a separate line). Basically, each string in File B will be compared with all the strings in File A and the resulting output is to show a list of matched/unmatched lines and optionally to write to a third File C File A: Unique strings File B: Can have duplicate strings (that is, string1 may appear more than once) My code currently looks like this: - FirstFile = open('C:\FileA.txt', 'r') SecondFile = open('C:\FileB.txt', 'r') ThirdFile = open('C:\FileC.txt', 'w') a = FirstFile.readlines() b = SecondFile.readlines() mydiff = difflib.Differ() results = mydiff(a,b) print(\n.join(results)) #ThirdFile.writelines(results) FirstFile.close() SecondFile.close() ThirdFile.close() - However, it seems that the results do not correctly reflect the matched/unmatched lines. As an example, if FileA contains string1 and FileB contains multiple occurrences of string1, it seems that the first occurrence matches correctly but subsequent string1s are treated as unmatched strings. I am thinking perhaps I don't understand Differ() that well and that it is not doing what I hoped to do? Is Differ() comparing first line to first line and second line to second line etc in contrast to what I wanted to do? Regards Let me guess your goal, and then, on that assumption, discuss your code. I think your File A is supposed to be a dictionary of valid words (strings). You want to process File B, checking each line against that dictionary, and make a list of which lines are valid (in the dictionary), and another of which lines are not (missing from the dictionary). That's one list for matched lines, and one for unmatched. That isn't even close to what difflib does. This can be solved with minimal code, but not by starting with difflib. What you should do is to loop through File A, adding all the lines to a set called valid_dictionary. Calling set(FirstFile) can do that in one line, without even calling readlines(). Then a simple loop can build the desired lists. The matched_lines is simply all lines which are in the dictionary, while unmatched_lines are those which are not. The heart of the comparison could simply look like: if line in valid_dictionary: matched_lines.append(line) else: unmatched_lines.append(line) -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor