I finally got it working! I would do a victory lap around my apartment
building if I wasn't recovering from a broken ankle.
Excuse my excitement, but this simple script marks a new level of
Python proficiency for me. Thanks to Kent, Bob, Denis, and others who
pointed me in the right direction.
It does quite a few things: decompresses a zipped file or files if
there is an archive of them, processes a rather ugly csv file (ugly
because it uses a comma as a delimiter, yet there are commas in double
quote separated fields), and it does a simple subtraction of the two
columns with a little summary to give me the data I need.
#!/usr/bin/env python
import string
import re
import zipfile
highflag = flagcount = sum = sumtotal = 0
z = zipfile.ZipFile('textfile.zip')
for subfile in z.namelist():
print "Working on filename: " + subfile + "\n"
data = z.read(subfile)
pat = re.compile(r"""(\d+), (\d+), (\".+\"|\w+)""")
for line in data.splitlines():
result = pat.match(line)
ranges = result.groups()
num1 = ranges[0]
num2 = ranges[1]
sum = int(num2) - int(num1)
if sum > 10000000:
flag1 = " !!!!"
flagcount += 1
else:
flag1 = ""
if sum > highflag:
highflag = sum
print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
sumtotal = sumtotal + sum
print "Total ranges = ", sumtotal
print "Total ranges over 10 million: ", flagcount
print "Largest range: ", highflag
A few observations from a Python newbie: The zipfile and gzip modules
should really be merged together. gzcat on unix reads both compression
formats. It took me way too long to figure out the namelist() method.
But I did learn a lot more about how zip actually works as a result.
Is there really no way to extract the contents of a single zipped file
without using a 'for in namelist' construct?
Trying to get split() to extract just two columns from my data was a
dead end. The re module is the way to go.
I feel like I am in relatively new territory with Python's regex
engine. Denis did save me some valuable time with his regex, but my
file had values in the 3rd column that started with alphas as opposed
to numerics only, and flipping that (\".+\"|\d+)""") to a (\".+\"|\w
+)""") had me gnashing teeth and pulling hair the whole way through
the regex tutorial. When I finally figured it out, I smack my forehead
and say "of course!". The compile() method of Python's regex engine is
new for me. Makes sense. Just something I have to get used to. I do
have the feeling that Perl's regex is better. But that is another story.
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor