Re: [Tutor] 1 to N searches in files

2012-12-03 Thread Spectral None
From: Dave Angel d...@davea.name
To: Spectral None spectraln...@yahoo.com.sg 
Cc: tutor@python.org tutor@python.org 
Sent: Sunday, 2 December 2012, 20:05
Subject: Re: [Tutor] 1 to N searches in files

On 12/02/2012 03:53 AM, Spectral None wrote:
 Hi all

 I have two files (File A and File B) with strings of data in them (each 
 string on a separate line). Basically, each string in File B will be compared 
 with all the strings in File A and the resulting output is to show a list of 
 matched/unmatched lines and optionally to write to a third File C

 File A: Unique strings
 File B: Can have duplicate strings (that is, string1 may appear more than 
 once)

 My code currently looks like this:

 -
 FirstFile = open('C:\FileA.txt', 'r')
 SecondFile = open('C:\FileB.txt', 'r')
 ThirdFile = open('C:\FileC.txt', 'w')

 a = FirstFile.readlines()
 b = SecondFile.readlines()

 mydiff = difflib.Differ()
 results = mydiff(a,b)
 print(\n.join(results))

 #ThirdFile.writelines(results)

 FirstFile.close()
 SecondFile.close()
 ThirdFile.close()
 -

 However, it seems that the results do not correctly reflect the 
 matched/unmatched lines. As an example, if FileA contains string1 and FileB 
 contains multiple occurrences of string1, it seems that the first 
 occurrence matches correctly but subsequent string1s are treated as 
 unmatched strings.

 I am thinking perhaps I don't understand Differ() that well and that it is 
 not doing what I hoped to do? Is Differ() comparing first line to first line 
 and second line to second line etc in contrast to what I wanted to do?

 Regards


 Let me guess your goal, and then, on that assumption, discuss your code.


 I think your File A is supposed to be a dictionary of valid words
 (strings).  You want to process File B, checking each line against that
 dictionary, and make a list of which lines are valid (in the
 dictionary), and another of which lines are not (missing from the
 dictionary).  That's one list for matched lines, and one for unmatched.

 That isn't even close to what difflib does.  This can be solved with
 minimal code, but not by starting with difflib.

 What you should do is to loop through File A, adding all the lines to a
 set called valid_dictionary.  Calling set(FirstFile) can do that in one
 line, without even calling readlines().
 Then a simple loop can build the desired lists.  The matched_lines is
 simply all lines which are in the dictionary, while unmatched_lines are
 those which are not.

 The heart of the comparison could simply look like:

     if line in valid_dictionary:
    matched_lines.append(line)
      else:
            unmatched_lines.append(line)


 -- 

 DaveA

-

Hi Dave

Your solution seems to work:

setA = set(FileA)
setB = set(FileB)

for line in setB:
  if line in setA:
    matched_lines.writelines(line)
  else:
    non_matched_lines.writelines(line)

There are no duplicates in the results as well. Thanks for helping out

Regards___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] 1 to N searches in files

2012-12-03 Thread Spectral None
 multiple occurrences of string1, it seems that the first occurrence 
matches correctly but subsequent string1s are treated as unmatched strings.

I am thinking perhaps I don't understand Differ() that well and that it is not 
doing what I hoped to do? Is Differ() comparing first line to first line and 
second line to second line etc in contrast to what I wanted to do?

Regards
-- next part --
An HTML attachment was scrubbed...
URL: 
http://mail.python.org/pipermail/tutor/attachments/20121202/6507cbad/attachment-0001.html

--

Message: 6
Date: Sun, 02 Dec 2012 20:34:24 +1100
From: Steven D'Aprano st...@pearwood.info
To: tutor@python.org
Subject: Re: [Tutor] 1 to N searches in files
Message-ID: 50bb20a0.5090...@pearwood.info
Content-Type: text/plain; charset=UTF-8; format=flowed

On 02/12/12 19:53, Spectral None wrote:

 However, it seems that the results do not correctly reflect the
matched/unmatched lines. As an example, if FileA contains string1
 and FileB contains multiple occurrences of string1, it seems that
? the first occurrence matches correctly but subsequent string1s
are treated as unmatched strings.

 I am thinking perhaps I don't understand Differ() that well and that
? it is not doing what I hoped to do? Is Differ() comparing first line
? to first line and second line to second line etc in contrast to what
? I wanted to do?

No, and yes.

No, it is not comparing first line to first line.

And yes, it is acting in contrast to what you hope to do, otherwise you
wouldn't be asking the question :-)

Unfortunately, you don't explain what it is that you hope to do, so I'm
going to have to guess. See below.

difflib is used for find differences between two files. It will try to
find a set of changes which will turn file A into file B, e.g:

insert this line here
delete this line there
...


and repeated as many times as needed. Except that difflib.Differ uses
a shorthand of + and - to indicate adding and deleting lines.

You can find out more about difflib and Differ objects by reading the
Fine Manual. Open a Python interactive shell, and do this:

import difflib
help(difflib.Differ)


If you have any questions, please feel free to ask.

In the code sample you give, you say you do this:

mydiff = difflib.Differ()
results = mydiff(a,b)

but that doesn't work, Differ objects are not callable. Please do not
paraphrase your code. Copy and paste the exact code you have actually
run, don't try to type it out from memory.

Now, I *guess* that what you are trying to do is something like this...
given files A and B:


# file A
spam
ham
eggs
tomato


# file B
tomato
spam
eggs
cheese
spam
spam


you want to generate three lists:

# lines in B that were also in A:
tomato
spam
eggs


# lines in B that were not in A:
cheese


# lines in A that were not found in B:
ham


Am I close?

If not, please explain with an example what you are trying
to do.


-- 
Steven


--

 Hi Steven

 I was searching for strings comparison and saw this article and decided to 
 try it. There was no error when I ran the code
 (http://stackoverflow.com/questions/11008519/detecting-and-printing-the-difference-between-two-text-files-using-python-3-2)

 In another reply by Dave about matching list of valid words, that is similar 
 to what I want to do. I guess I probably misunderstood the usage of Differ(). 
 Thanks for the help!

 Regards

Hi Steven

My apologies as well for not being clear in my explanation.

Regards___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] 1 to N searches in files

2012-12-03 Thread Dave Angel
On 12/03/2012 10:46 AM, Spectral None wrote:
 snip
 Hi Dave
 
 Your solution seems to work:
 
 setA = set(FileA)
 setB = set(FileB)
 
 for line in setB:
   if line in setA:
 matched_lines.writelines(line)
   else:
 non_matched_lines.writelines(line)
 
 There are no duplicates in the results as well. Thanks for helping out
 

You didn't specify whether you wanted dups to be noticed.  You had said
that there were none in A, but B was unspecified.

The other question is order.  If you want original order, you'd have to
omit the setB step, and iterate on FileB.  And then eliminate dups as
you go.

Or if you want sorted order without dups, you could simply iterate on
sorted(setB).


-- 

DaveA
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] 1 to N searches in files

2012-12-02 Thread Steven D'Aprano

On 02/12/12 19:53, Spectral None wrote:


However, it seems that the results do not correctly reflect the
matched/unmatched lines. As an example, if FileA contains string1
and FileB contains multiple occurrences of string1, it seems that
 the first occurrence matches correctly but subsequent string1s
are treated as unmatched strings.

I am thinking perhaps I don't understand Differ() that well and that
 it is not doing what I hoped to do? Is Differ() comparing first line
 to first line and second line to second line etc in contrast to what
 I wanted to do?


No, and yes.

No, it is not comparing first line to first line.

And yes, it is acting in contrast to what you hope to do, otherwise you
wouldn't be asking the question :-)

Unfortunately, you don't explain what it is that you hope to do, so I'm
going to have to guess. See below.

difflib is used for find differences between two files. It will try to
find a set of changes which will turn file A into file B, e.g:

insert this line here
delete this line there
...


and repeated as many times as needed. Except that difflib.Differ uses
a shorthand of + and - to indicate adding and deleting lines.

You can find out more about difflib and Differ objects by reading the
Fine Manual. Open a Python interactive shell, and do this:

import difflib
help(difflib.Differ)


If you have any questions, please feel free to ask.

In the code sample you give, you say you do this:

mydiff = difflib.Differ()
results = mydiff(a,b)

but that doesn't work, Differ objects are not callable. Please do not
paraphrase your code. Copy and paste the exact code you have actually
run, don't try to type it out from memory.

Now, I *guess* that what you are trying to do is something like this...
given files A and B:


# file A
spam
ham
eggs
tomato


# file B
tomato
spam
eggs
cheese
spam
spam


you want to generate three lists:

# lines in B that were also in A:
tomato
spam
eggs


# lines in B that were not in A:
cheese


# lines in A that were not found in B:
ham


Am I close?

If not, please explain with an example what you are trying
to do.


--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] 1 to N searches in files

2012-12-02 Thread Dave Angel
On 12/02/2012 03:53 AM, Spectral None wrote:
 Hi all

 I have two files (File A and File B) with strings of data in them (each 
 string on a separate line). Basically, each string in File B will be compared 
 with all the strings in File A and the resulting output is to show a list of 
 matched/unmatched lines and optionally to write to a third File C

 File A: Unique strings
 File B: Can have duplicate strings (that is, string1 may appear more than 
 once)

 My code currently looks like this:

 -
 FirstFile = open('C:\FileA.txt', 'r')
 SecondFile = open('C:\FileB.txt', 'r')
 ThirdFile = open('C:\FileC.txt', 'w')

 a = FirstFile.readlines()
 b = SecondFile.readlines()

 mydiff = difflib.Differ()
 results = mydiff(a,b)
 print(\n.join(results))

 #ThirdFile.writelines(results)

 FirstFile.close()
 SecondFile.close()
 ThirdFile.close()
 -

 However, it seems that the results do not correctly reflect the 
 matched/unmatched lines. As an example, if FileA contains string1 and FileB 
 contains multiple occurrences of string1, it seems that the first 
 occurrence matches correctly but subsequent string1s are treated as 
 unmatched strings.

 I am thinking perhaps I don't understand Differ() that well and that it is 
 not doing what I hoped to do? Is Differ() comparing first line to first line 
 and second line to second line etc in contrast to what I wanted to do?

 Regards


Let me guess your goal, and then, on that assumption, discuss your code.


I think your File A is supposed to be a dictionary of valid words
(strings).  You want to process File B, checking each line against that
dictionary, and make a list of which lines are valid (in the
dictionary), and another of which lines are not (missing from the
dictionary).   That's one list for matched lines, and one for unmatched.

That isn't even close to what difflib does.  This can be solved with
minimal code, but not by starting with difflib.

What you should do is to loop through File A, adding all the lines to a
set called valid_dictionary.  Calling set(FirstFile) can do that in one
line, without even calling readlines().
Then a simple loop can build the desired lists.  The matched_lines is
simply all lines which are in the dictionary, while unmatched_lines are
those which are not.

The heart of the comparison could simply look like:

   if line in valid_dictionary:
 matched_lines.append(line)
   else:
 unmatched_lines.append(line)


-- 

DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor