Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-09 Thread galaxywatcher
After many more hours of reading and testing, I am still struggling to  
finish this simple script, which bear in mind, I already got my  
desired results by preprocessing with an awk one-liner.


I am opening a zipped file properly, so I did make some progress, but  
simply assigning num1 and num2 to the first 2 columns of the file  
remains elusive. Num3 here gets assigned, not to the 3rd column, but  
the rest of the entire file. I feel like I am missing a simple strip()  
or some other incantation that prevents the entire file from getting  
blobbed into num3. Any help is appreciated in advance.


#!/usr/bin/env python

import string
import re
import zipfile
highflag = flagcount = sum = sumtotal = 0
f = file(test.zip)
z = zipfile.ZipFile(f)
for f in z.namelist():
ranges = z.read(f)
ranges = ranges.strip()
num1, num2, num3 = re.split('\W+', ranges, 2)  ## This line is  
the root of the problem.

sum = int(num2) - int(num1)
if sum  1000:
flag1 =  
flagcount += 1
else:
flag1 = 
if sum  highflag:
highflag = sum
print str(num2) +  -  + str(num1) +  =  + str(sum) + flag1
sumtotal = sumtotal + sum

print Total ranges = , sumtotal
print Total ranges over 10 million: , flagcount
print Largest range: , highflag

==
$ zcat test.zip
134873600, 134873855, 32787 Protex Technologies, Inc.
135338240, 135338495, 40597
135338496, 135338751, 40993
201720832, 201721087, 12838 HFF Infrastructure  Operations
202739456, 202739711, 1623 Beseau Regional de la Region Languedoc  
Roussillon


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-09 Thread Dave Angel

galaxywatc...@gmail.com wrote:
div class=moz-text-flowed style=font-family: -moz-fixedAfter 
many more hours of reading and testing, I am still struggling to 
finish this simple script, which bear in mind, I already got my 
desired results by preprocessing with an awk one-liner.


I am opening a zipped file properly, so I did make some progress, but 
simply assigning num1 and num2 to the first 2 columns of the file 
remains elusive. Num3 here gets assigned, not to the 3rd column, but 
the rest of the entire file. I feel like I am missing a simple strip() 
or some other incantation that prevents the entire file from getting 
blobbed into num3. Any help is appreciated in advance.


#!/usr/bin/env python

import string
import re
import zipfile
highflag = flagcount = sum = sumtotal = 0
f = file(test.zip)
z = zipfile.ZipFile(f)
for f in z.namelist():
ranges = z.read(f)
This reads the whole file into ranges.  In your earlier incantation, you 
looped over the file, one line at a time.  So to do the equivalent, you 
want to do a split here, and one more

nesting of loops.
   lines = z.read(f).split(\n)#build a list of text lines
   for ranges in lines:#here, ranges is a single line

and of course, indent the remainder.

ranges = ranges.strip()
num1, num2, num3 = re.split('\W+', ranges, 2)  ## This line is the 
root of the problem.

sum = int(num2) - int(num1)
if sum  1000:
flag1 =  
flagcount += 1
else:
flag1 = 
if sum  highflag:
highflag = sum
print str(num2) +  -  + str(num1) +  =  + str(sum) + flag1
sumtotal = sumtotal + sum

print Total ranges = , sumtotal
print Total ranges over 10 million: , flagcount
print Largest range: , highflag

==
$ zcat test.zip
134873600, 134873855, 32787 Protex Technologies, Inc.
135338240, 135338495, 40597
135338496, 135338751, 40993
201720832, 201721087, 12838 HFF Infrastructure  Operations
202739456, 202739711, 1623 Beseau Regional de la Region Languedoc 
Roussillon



/div


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-09 Thread galaxywatcher
I finally got it working! I would do a victory lap around my apartment  
building if I wasn't recovering from a broken ankle.


Excuse my excitement, but this simple script marks a new level of  
Python proficiency for me. Thanks to Kent, Bob, Denis, and others who  
pointed me in the right direction.
It does quite a few things: decompresses a zipped file or files if  
there is an archive of them, processes a rather ugly csv file (ugly  
because it uses a comma as a delimiter, yet there are commas in double  
quote separated fields), and it does a simple subtraction of the two  
columns with a little summary to give me the data I need.


#!/usr/bin/env python
import string
import re
import zipfile
highflag = flagcount = sum = sumtotal = 0
z = zipfile.ZipFile('textfile.zip')
for subfile in z.namelist():
print Working on filename:  + subfile + \n
data = z.read(subfile)
pat = re.compile(r(\d+), (\d+), (\.+\|\w+))
for line in data.splitlines():
result = pat.match(line)
ranges = result.groups()
num1 = ranges[0]
num2 = ranges[1]
sum = int(num2) - int(num1)
if sum  1000:
flag1 =  
flagcount += 1
else:
flag1 = 
if sum  highflag:
highflag = sum
print str(num2) +  -  + str(num1) +  =  + str(sum) + flag1
sumtotal = sumtotal + sum

print Total ranges = , sumtotal
print Total ranges over 10 million: , flagcount
print Largest range: , highflag

A few observations from a Python newbie: The zipfile and gzip modules  
should really be merged together. gzcat on unix reads both compression  
formats. It took me way too long to figure out the namelist() method.  
But I did learn a lot more about how zip actually works as a result.  
Is there really no way to extract the contents of a single zipped file  
without using a 'for in namelist' construct?


Trying to get split() to extract just two columns from my data was a  
dead end. The re module is the way to go.


I feel like I am in relatively new territory with Python's regex  
engine. Denis did save me some valuable time with his regex, but my  
file had values in the 3rd column that started with alphas as opposed  
to numerics only, and flipping that (\.+\|\d+)) to a (\.+\|\w 
+)) had me gnashing teeth and pulling hair the whole way through  
the regex tutorial. When I finally figured it out, I smack my forehead  
and say of course!. The compile() method of Python's regex engine is  
new for me. Makes sense. Just something I have to get used to. I do  
have the feeling that Perl's regex is better. But that is another story.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-08 Thread spir
galaxywatc...@gmail.com dixit:

 I wrote a simple Python script to process a text file, but I had to  
 run a shell one liner to get the text file primed for the script. I  
 would much rather have the Python script handle the whole task without  
 any pre-processing at all. I will show 1) a small sample of the text  
 file, 2) my script, 3) the one liner that I want to fold into the  
 script, and 4) the task at hand.
 
 1) $ zcat textfile.txt.zip | head -5
 134873600, 134873855, 32787 Protex Technologies, Inc.
 135338240, 135338495, 40597
 135338496, 135338751, 40993
 201720832, 201721087, 12838 HFF Infrastructure  Operations
 202739456, 202739711, 1623 Beseau Regional de la Region Languedoc  
 Roussillon
 
 
 2) $ cat getranges.py
 #!/usr/bin/env python
 
 import string
 
 highflag = flagcount = sum = sumtotal = 0
 infile = open(textfile.txt, r)
 # Find the range by subtracting column 1 from column 2
 for line in infile:
  num1, num2 = string.split(line)
  sum = int(num2) - int(num1)
  if sum  1000:
  flag1 =  
  flagcount += 1
  if sum  highflag:
  highflag = sum
  else:
  flag1 = 
  print str(num2) +  -  + str(num1) +  =  + str(sum) + flag1
  sumtotal = sumtotal + sum
 print Total ranges = , sumtotal
 print Total # of ranges over 10 million: , flagcount
 print Largest range: , highflag
 
 3) zcat textfile.txt.zip | awk -F, '{print $1, $2}'  textfile.txt
 
 4) In my first iteration, I used string.split(num1, ,) but I ran  
 into trouble when I encountered commas within column 3, such as 32787  
 Protexic Technologies, Inc.. I don't know how to handle this  
 exception. I also don't know how to uncompress the file in Python and  
 pass it to the rest of the script. Hence I used my zcat | awk oneliner  
 to get the job done. So how do I uncompress zip and gzipped files in  
 Python, and how do I force split to only evaluate the first two  
 columns? Better yet, can I tell split to not evaluate commas in the  
 double quoted 3rd column?
 
 Regards,
 Blake

There are several possibilities:

1) The choice of ',' as separator for data that can contain commas is , hem, 
not very clever ;-)
Can you change that, so as to solve the issue at its source? (eg: any text 
processor allows converting a table to plain text using whatever separator). 
CSV is not a panacea...

2) Preprocess data to replace commas _outside quotes_ by a better chosen sep, 
such as TAB
(eg read data while keeping an in_quotes flag).

3) Use a more powerful text processing tool, such as regexes:

data = '''\
134873600, 134873855, 32787 Protex Technologies, Inc.
135338240, 135338495, 40597
135338496, 135338751, 40993
201720832, 201721087, 12838 HFF Infrastructure  Operations
202739456, 202739711, 1623 Beseau Regional de la Region Languedoc 
Roussillon'''
import re
pat = re.compile(r(\d+), (\d+), (\.+\|\d+))
for line in data.splitlines():
result = pat.match(line)
print result.groups()
==
('134873600', '134873855', '32787 Protex Technologies, Inc.')
('135338240', '135338495', '40597')
('135338496', '135338751', '40993')
('201720832', '201721087', '12838 HFF Infrastructure  Operations')
('202739456', '202739711', '1623 Beseau Regional de la Region Languedoc 
Roussillon')

Denis



la vita e estrany

http://spir.wikidot.com/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-08 Thread Kent Johnson
On Fri, Jan 8, 2010 at 2:24 AM,  galaxywatc...@gmail.com wrote:

 So how do I
 uncompress zip and gzipped files in Python,

zipfile and gzip

 and how do I force split to only
 evaluate the first two columns?

Use the optional second argument to split():
line.split(',', 2)

 Better yet, can I tell split to not evaluate
 commas in the double quoted 3rd column?

Use the csv module.

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor