I finally got it working! I would do a victory lap around my apartment building if I wasn't recovering from a broken ankle.

Excuse my excitement, but this simple script marks a new level of Python proficiency for me. Thanks to Kent, Bob, Denis, and others who pointed me in the right direction. It does quite a few things: decompresses a zipped file or files if there is an archive of them, processes a rather ugly csv file (ugly because it uses a comma as a delimiter, yet there are commas in double quote separated fields), and it does a simple subtraction of the two columns with a little summary to give me the data I need.

#!/usr/bin/env python
import string
import re
import zipfile
highflag = flagcount = sum = sumtotal = 0
z = zipfile.ZipFile('textfile.zip')
for subfile in z.namelist():
    print "Working on filename: " + subfile + "\n"
    data = z.read(subfile)
    pat = re.compile(r"""(\d+), (\d+), (\".+\"|\w+)""")
    for line in data.splitlines():
        result = pat.match(line)
        ranges = result.groups()
        num1 = ranges[0]
        num2 = ranges[1]
        sum = int(num2) - int(num1)
        if sum > 10000000:
            flag1 = " !!!!"
            flagcount += 1
        else:
            flag1 = ""
        if sum > highflag:
            highflag = sum
        print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
        sumtotal = sumtotal + sum

print "Total ranges = ", sumtotal
print "Total ranges over 10 million: ", flagcount
print "Largest range: ", highflag

A few observations from a Python newbie: The zipfile and gzip modules should really be merged together. gzcat on unix reads both compression formats. It took me way too long to figure out the namelist() method. But I did learn a lot more about how zip actually works as a result. Is there really no way to extract the contents of a single zipped file without using a 'for in namelist' construct?

Trying to get split() to extract just two columns from my data was a dead end. The re module is the way to go.

I feel like I am in relatively new territory with Python's regex engine. Denis did save me some valuable time with his regex, but my file had values in the 3rd column that started with alphas as opposed to numerics only, and flipping that (\".+\"|\d+)""") to a (\".+\"|\w +)""") had me gnashing teeth and pulling hair the whole way through the regex tutorial. When I finally figured it out, I smack my forehead and say "of course!". The compile() method of Python's regex engine is new for me. Makes sense. Just something I have to get used to. I do have the feeling that Perl's regex is better. But that is another story.
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to