From: Dave Angel <d...@davea.name>
To: Spectral None <spectraln...@yahoo.com.sg> 
Cc: "tutor@python.org" <tutor@python.org> 
Sent: Sunday, 2 December 2012, 20:05
Subject: Re: [Tutor] 1 to N searches in files

On 12/02/2012 03:53 AM, Spectral None wrote:
> Hi all
>
> I have two files (File A and File B) with strings of data in them (each 
> string on a separate line). Basically, each string in File B will be compared 
> with all the strings in File A and the resulting output is to show a list of 
> matched/unmatched lines and optionally to write to a third File C
>
> File A: Unique strings
> File B: Can have duplicate strings (that is, "string1" may appear more than 
> once)
>
> My code currently looks like this:
>
> -----------------
> FirstFile = open('C:\FileA.txt', 'r')
> SecondFile = open('C:\FileB.txt', 'r')
> ThirdFile = open('C:\FileC.txt', 'w')
>
> a = FirstFile.readlines()
> b = SecondFile.readlines()
>
> mydiff = difflib.Differ()
> results = mydiff(a,b)
> print("\n".join(results))
>
> #ThirdFile.writelines(results)
>
> FirstFile.close()
> SecondFile.close()
> ThirdFile.close()
> ---------------------
>
> However, it seems that the results do not correctly reflect the 
> matched/unmatched lines. As an example, if FileA contains "string1" and FileB 
> contains multiple occurrences of "string1", it seems that the first 
> occurrence matches correctly but subsequent "string1"s are treated as 
> unmatched strings.
>
> I am thinking perhaps I don't understand Differ() that well and that it is 
> not doing what I hoped to do? Is Differ() comparing first line to first line 
> and second line to second line etc in contrast to what I wanted to do?
>
> Regards
>
>
> Let me guess your goal, and then, on that assumption, discuss your code.


> I think your File A is supposed to be a dictionary of valid words
> (strings).  You want to process File B, checking each line against that
> dictionary, and make a list of which lines are "valid" (in the
> dictionary), and another of which lines are not (missing from the
> dictionary).  That's one list for matched lines, and one for unmatched.

> That isn't even close to what difflib does.  This can be solved with
> minimal code, but not by starting with difflib.

> What you should do is to loop through File A, adding all the lines to a
> set called valid_dictionary.  Calling set(FirstFile) can do that in one
> line, without even calling readlines().
> Then a simple loop can build the desired lists.  The matched_lines is
> simply all lines which are in the dictionary, while unmatched_lines are
> those which are not.

> The heart of the comparison could simply look like:

>     if line in valid_dictionary:
>            matched_lines.append(line)
>      else:
>            unmatched_lines.append(line)


> -- 

> DaveA

---------------------

Hi Dave

Your solution seems to work:

setA = set(FileA)
setB = set(FileB)

for line in setB:
  if line in setA:
    matched_lines.writelines(line)
  else:
    non_matched_lines.writelines(line)

There are no duplicates in the results as well. Thanks for helping out

Regards
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to