Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-09 Thread galaxywatcher
I finally got it working! I would do a victory lap around my apartment  
building if I wasn't recovering from a broken ankle.


Excuse my excitement, but this simple script marks a new level of  
Python proficiency for me. Thanks to Kent, Bob, Denis, and others who  
pointed me in the right direction.
It does quite a few things: decompresses a zipped file or files if  
there is an archive of them, processes a rather ugly csv file (ugly  
because it uses a comma as a delimiter, yet there are commas in double  
quote separated fields), and it does a simple subtraction of the two  
columns with a little summary to give me the data I need.


#!/usr/bin/env python
import string
import re
import zipfile
highflag = flagcount = sum = sumtotal = 0
z = zipfile.ZipFile('textfile.zip')
for subfile in z.namelist():
print "Working on filename: " + subfile + "\n"
data = z.read(subfile)
pat = re.compile(r"""(\d+), (\d+), (\".+\"|\w+)""")
for line in data.splitlines():
result = pat.match(line)
ranges = result.groups()
num1 = ranges[0]
num2 = ranges[1]
sum = int(num2) - int(num1)
if sum > 1000:
flag1 = " "
flagcount += 1
else:
flag1 = ""
if sum > highflag:
highflag = sum
print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
sumtotal = sumtotal + sum

print "Total ranges = ", sumtotal
print "Total ranges over 10 million: ", flagcount
print "Largest range: ", highflag

A few observations from a Python newbie: The zipfile and gzip modules  
should really be merged together. gzcat on unix reads both compression  
formats. It took me way too long to figure out the namelist() method.  
But I did learn a lot more about how zip actually works as a result.  
Is there really no way to extract the contents of a single zipped file  
without using a 'for in namelist' construct?


Trying to get split() to extract just two columns from my data was a  
dead end. The re module is the way to go.


I feel like I am in relatively new territory with Python's regex  
engine. Denis did save me some valuable time with his regex, but my  
file had values in the 3rd column that started with alphas as opposed  
to numerics only, and flipping that (\".+\"|\d+)""") to a (\".+\"|\w 
+)""") had me gnashing teeth and pulling hair the whole way through  
the regex tutorial. When I finally figured it out, I smack my forehead  
and say "of course!". The compile() method of Python's regex engine is  
new for me. Makes sense. Just something I have to get used to. I do  
have the feeling that Perl's regex is better. But that is another story.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-09 Thread Dave Angel

galaxywatc...@gmail.com wrote:
After 
many more hours of reading and testing, I am still struggling to 
finish this simple script, which bear in mind, I already got my 
desired results by preprocessing with an awk one-liner.


I am opening a zipped file properly, so I did make some progress, but 
simply assigning num1 and num2 to the first 2 columns of the file 
remains elusive. Num3 here gets assigned, not to the 3rd column, but 
the rest of the entire file. I feel like I am missing a simple strip() 
or some other incantation that prevents the entire file from getting 
blobbed into num3. Any help is appreciated in advance.


#!/usr/bin/env python

import string
import re
import zipfile
highflag = flagcount = sum = sumtotal = 0
f = file("test.zip")
z = zipfile.ZipFile(f)
for f in z.namelist():
ranges = z.read(f)
This reads the whole file into ranges.  In your earlier incantation, you 
looped over the file, one line at a time.  So to do the equivalent, you 
want to do a split here, and one more

nesting of loops.
   lines = z.read(f).split("\n")#build a list of text lines
   for ranges in lines:#here, ranges is a single line

and of course, indent the remainder.

ranges = ranges.strip()
num1, num2, num3 = re.split('\W+', ranges, 2)  ## This line is the 
root of the problem.

sum = int(num2) - int(num1)
if sum > 1000:
flag1 = " "
flagcount += 1
else:
flag1 = ""
if sum > highflag:
highflag = sum
print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
sumtotal = sumtotal + sum

print "Total ranges = ", sumtotal
print "Total ranges over 10 million: ", flagcount
print "Largest range: ", highflag

==
$ zcat test.zip
134873600, 134873855, "32787 Protex Technologies, Inc."
135338240, 135338495, 40597
135338496, 135338751, 40993
201720832, 201721087, "12838 HFF Infrastructure & Operations"
202739456, 202739711, "1623 Beseau Regional de la Region Languedoc 
Roussillon"






___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-09 Thread galaxywatcher
After many more hours of reading and testing, I am still struggling to  
finish this simple script, which bear in mind, I already got my  
desired results by preprocessing with an awk one-liner.


I am opening a zipped file properly, so I did make some progress, but  
simply assigning num1 and num2 to the first 2 columns of the file  
remains elusive. Num3 here gets assigned, not to the 3rd column, but  
the rest of the entire file. I feel like I am missing a simple strip()  
or some other incantation that prevents the entire file from getting  
blobbed into num3. Any help is appreciated in advance.


#!/usr/bin/env python

import string
import re
import zipfile
highflag = flagcount = sum = sumtotal = 0
f = file("test.zip")
z = zipfile.ZipFile(f)
for f in z.namelist():
ranges = z.read(f)
ranges = ranges.strip()
num1, num2, num3 = re.split('\W+', ranges, 2)  ## This line is  
the root of the problem.

sum = int(num2) - int(num1)
if sum > 1000:
flag1 = " "
flagcount += 1
else:
flag1 = ""
if sum > highflag:
highflag = sum
print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
sumtotal = sumtotal + sum

print "Total ranges = ", sumtotal
print "Total ranges over 10 million: ", flagcount
print "Largest range: ", highflag

==
$ zcat test.zip
134873600, 134873855, "32787 Protex Technologies, Inc."
135338240, 135338495, 40597
135338496, 135338751, 40993
201720832, 201721087, "12838 HFF Infrastructure & Operations"
202739456, 202739711, "1623 Beseau Regional de la Region Languedoc  
Roussillon"


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-08 Thread Kent Johnson
On Fri, Jan 8, 2010 at 2:24 AM,   wrote:

> So how do I
> uncompress zip and gzipped files in Python,

zipfile and gzip

> and how do I force split to only
> evaluate the first two columns?

Use the optional second argument to split():
line.split(',', 2)

> Better yet, can I tell split to not evaluate
> commas in the double quoted 3rd column?

Use the csv module.

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-08 Thread spir
galaxywatc...@gmail.com dixit:

> I wrote a simple Python script to process a text file, but I had to  
> run a shell one liner to get the text file primed for the script. I  
> would much rather have the Python script handle the whole task without  
> any pre-processing at all. I will show 1) a small sample of the text  
> file, 2) my script, 3) the one liner that I want to fold into the  
> script, and 4) the task at hand.
> 
> 1) $ zcat textfile.txt.zip | head -5
> 134873600, 134873855, "32787 Protex Technologies, Inc."
> 135338240, 135338495, 40597
> 135338496, 135338751, 40993
> 201720832, 201721087, "12838 HFF Infrastructure & Operations"
> 202739456, 202739711, "1623 Beseau Regional de la Region Languedoc  
> Roussillon"
> 
> 
> 2) $ cat getranges.py
> #!/usr/bin/env python
> 
> import string
> 
> highflag = flagcount = sum = sumtotal = 0
> infile = open("textfile.txt", "r")
> # Find the range by subtracting column 1 from column 2
> for line in infile:
>  num1, num2 = string.split(line)
>  sum = int(num2) - int(num1)
>  if sum > 1000:
>  flag1 = " "
>  flagcount += 1
>  if sum > highflag:
>  highflag = sum
>  else:
>  flag1 = ""
>  print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
>  sumtotal = sumtotal + sum
> print "Total ranges = ", sumtotal
> print "Total # of ranges over 10 million: ", flagcount
> print "Largest range: ", highflag
> 
> 3) zcat textfile.txt.zip | awk -F"," '{print $1, $2}' > textfile.txt
> 
> 4) In my first iteration, I used string.split(num1, ",") but I ran  
> into trouble when I encountered commas within column 3, such as "32787  
> Protexic Technologies, Inc.". I don't know how to handle this  
> exception. I also don't know how to uncompress the file in Python and  
> pass it to the rest of the script. Hence I used my zcat | awk oneliner  
> to get the job done. So how do I uncompress zip and gzipped files in  
> Python, and how do I force split to only evaluate the first two  
> columns? Better yet, can I tell split to not evaluate commas in the  
> double quoted 3rd column?
> 
> Regards,
> Blake

There are several possibilities:

1) The choice of ',' as separator for data that can contain commas is , hem, 
not very clever ;-)
Can you change that, so as to solve the issue at its source? (eg: any text 
processor allows converting a table to plain text using whatever separator). 
CSV is not a panacea...

2) Preprocess data to replace commas _outside quotes_ by a better chosen sep, 
such as TAB
(eg read data while keeping an "in_quotes" flag).

3) Use a more powerful text processing tool, such as regexes:

data = '''\
134873600, 134873855, "32787 Protex Technologies, Inc."
135338240, 135338495, 40597
135338496, 135338751, 40993
201720832, 201721087, "12838 HFF Infrastructure & Operations"
202739456, 202739711, "1623 Beseau Regional de la Region Languedoc 
Roussillon"'''
import re
pat = re.compile(r"""(\d+), (\d+), (\".+\"|\d+)""")
for line in data.splitlines():
result = pat.match(line)
print result.groups()
==>
('134873600', '134873855', '"32787 Protex Technologies, Inc."')
('135338240', '135338495', '40597')
('135338496', '135338751', '40993')
('201720832', '201721087', '"12838 HFF Infrastructure & Operations"')
('202739456', '202739711', '"1623 Beseau Regional de la Region Languedoc 
Roussillon"')

Denis



la vita e estrany

http://spir.wikidot.com/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Expanding a Python script to include a zcat and awk pre-process

2010-01-07 Thread galaxywatcher
I wrote a simple Python script to process a text file, but I had to  
run a shell one liner to get the text file primed for the script. I  
would much rather have the Python script handle the whole task without  
any pre-processing at all. I will show 1) a small sample of the text  
file, 2) my script, 3) the one liner that I want to fold into the  
script, and 4) the task at hand.


1) $ zcat textfile.txt.zip | head -5
134873600, 134873855, "32787 Protex Technologies, Inc."
135338240, 135338495, 40597
135338496, 135338751, 40993
201720832, 201721087, "12838 HFF Infrastructure & Operations"
202739456, 202739711, "1623 Beseau Regional de la Region Languedoc  
Roussillon"



2) $ cat getranges.py
#!/usr/bin/env python

import string

highflag = flagcount = sum = sumtotal = 0
infile = open("textfile.txt", "r")
# Find the range by subtracting column 1 from column 2
for line in infile:
num1, num2 = string.split(line)
sum = int(num2) - int(num1)
if sum > 1000:
flag1 = " "
flagcount += 1
if sum > highflag:
highflag = sum
else:
flag1 = ""
print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
sumtotal = sumtotal + sum
print "Total ranges = ", sumtotal
print "Total # of ranges over 10 million: ", flagcount
print "Largest range: ", highflag

3) zcat textfile.txt.zip | awk -F"," '{print $1, $2}' > textfile.txt

4) In my first iteration, I used string.split(num1, ",") but I ran  
into trouble when I encountered commas within column 3, such as "32787  
Protexic Technologies, Inc.". I don't know how to handle this  
exception. I also don't know how to uncompress the file in Python and  
pass it to the rest of the script. Hence I used my zcat | awk oneliner  
to get the job done. So how do I uncompress zip and gzipped files in  
Python, and how do I force split to only evaluate the first two  
columns? Better yet, can I tell split to not evaluate commas in the  
double quoted 3rd column?


Regards,
Blake
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor