From: Dave Angel <d...@davea.name>
To: Spectral None <spectraln...@yahoo.com.sg>
Cc: "tutor@python.org" <tutor@python.org>
Sent: Sunday, 2 December 2012, 20:05
Subject: Re: [Tutor] 1 to N searches in files
On 12/02/2012 03:53 AM, Spectral None wrote:
> Hi all
>
> I have two files (File A and File B) with strings of data in them (each
> string on a separate line). Basically, each string in File B will be compared
> with all the strings in File A and the resulting output is to show a list of
> matched/unmatched lines and optionally to write to a third File C
>
> File A: Unique strings
> File B: Can have duplicate strings (that is, "string1" may appear more than
> once)
>
> My code currently looks like this:
>
> -----------------
> FirstFile = open('C:\FileA.txt', 'r')
> SecondFile = open('C:\FileB.txt', 'r')
> ThirdFile = open('C:\FileC.txt', 'w')
>
> a = FirstFile.readlines()
> b = SecondFile.readlines()
>
> mydiff = difflib.Differ()
> results = mydiff(a,b)
> print("\n".join(results))
>
> #ThirdFile.writelines(results)
>
> FirstFile.close()
> SecondFile.close()
> ThirdFile.close()
> ---------------------
>
> However, it seems that the results do not correctly reflect the
> matched/unmatched lines. As an example, if FileA contains "string1" and FileB
> contains multiple occurrences of "string1", it seems that the first
> occurrence matches correctly but subsequent "string1"s are treated as
> unmatched strings.
>
> I am thinking perhaps I don't understand Differ() that well and that it is
> not doing what I hoped to do? Is Differ() comparing first line to first line
> and second line to second line etc in contrast to what I wanted to do?
>
> Regards
>
>
> Let me guess your goal, and then, on that assumption, discuss your code.
> I think your File A is supposed to be a dictionary of valid words
> (strings). You want to process File B, checking each line against that
> dictionary, and make a list of which lines are "valid" (in the
> dictionary), and another of which lines are not (missing from the
> dictionary). That's one list for matched lines, and one for unmatched.
> That isn't even close to what difflib does. This can be solved with
> minimal code, but not by starting with difflib.
> What you should do is to loop through File A, adding all the lines to a
> set called valid_dictionary. Calling set(FirstFile) can do that in one
> line, without even calling readlines().
> Then a simple loop can build the desired lists. The matched_lines is
> simply all lines which are in the dictionary, while unmatched_lines are
> those which are not.
> The heart of the comparison could simply look like:
> if line in valid_dictionary:
> matched_lines.append(line)
> else:
> unmatched_lines.append(line)
> --
> DaveA
---------------------
Hi Dave
Your solution seems to work:
setA = set(FileA)
setB = set(FileB)
for line in setB:
if line in setA:
matched_lines.writelines(line)
else:
non_matched_lines.writelines(line)
There are no duplicates in the results as well. Thanks for helping out
Regards
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor