Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process
I finally got it working! I would do a victory lap around my apartment building if I wasn't recovering from a broken ankle. Excuse my excitement, but this simple script marks a new level of Python proficiency for me. Thanks to Kent, Bob, Denis, and others who pointed me in the right direction. It does quite a few things: decompresses a zipped file or files if there is an archive of them, processes a rather ugly csv file (ugly because it uses a comma as a delimiter, yet there are commas in double quote separated fields), and it does a simple subtraction of the two columns with a little summary to give me the data I need. #!/usr/bin/env python import string import re import zipfile highflag = flagcount = sum = sumtotal = 0 z = zipfile.ZipFile('textfile.zip') for subfile in z.namelist(): print "Working on filename: " + subfile + "\n" data = z.read(subfile) pat = re.compile(r"""(\d+), (\d+), (\".+\"|\w+)""") for line in data.splitlines(): result = pat.match(line) ranges = result.groups() num1 = ranges[0] num2 = ranges[1] sum = int(num2) - int(num1) if sum > 1000: flag1 = " " flagcount += 1 else: flag1 = "" if sum > highflag: highflag = sum print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1 sumtotal = sumtotal + sum print "Total ranges = ", sumtotal print "Total ranges over 10 million: ", flagcount print "Largest range: ", highflag A few observations from a Python newbie: The zipfile and gzip modules should really be merged together. gzcat on unix reads both compression formats. It took me way too long to figure out the namelist() method. But I did learn a lot more about how zip actually works as a result. Is there really no way to extract the contents of a single zipped file without using a 'for in namelist' construct? Trying to get split() to extract just two columns from my data was a dead end. The re module is the way to go. I feel like I am in relatively new territory with Python's regex engine. Denis did save me some valuable time with his regex, but my file had values in the 3rd column that started with alphas as opposed to numerics only, and flipping that (\".+\"|\d+)""") to a (\".+\"|\w +)""") had me gnashing teeth and pulling hair the whole way through the regex tutorial. When I finally figured it out, I smack my forehead and say "of course!". The compile() method of Python's regex engine is new for me. Makes sense. Just something I have to get used to. I do have the feeling that Perl's regex is better. But that is another story. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process
galaxywatc...@gmail.com wrote: After many more hours of reading and testing, I am still struggling to finish this simple script, which bear in mind, I already got my desired results by preprocessing with an awk one-liner. I am opening a zipped file properly, so I did make some progress, but simply assigning num1 and num2 to the first 2 columns of the file remains elusive. Num3 here gets assigned, not to the 3rd column, but the rest of the entire file. I feel like I am missing a simple strip() or some other incantation that prevents the entire file from getting blobbed into num3. Any help is appreciated in advance. #!/usr/bin/env python import string import re import zipfile highflag = flagcount = sum = sumtotal = 0 f = file("test.zip") z = zipfile.ZipFile(f) for f in z.namelist(): ranges = z.read(f) This reads the whole file into ranges. In your earlier incantation, you looped over the file, one line at a time. So to do the equivalent, you want to do a split here, and one more nesting of loops. lines = z.read(f).split("\n")#build a list of text lines for ranges in lines:#here, ranges is a single line and of course, indent the remainder. ranges = ranges.strip() num1, num2, num3 = re.split('\W+', ranges, 2) ## This line is the root of the problem. sum = int(num2) - int(num1) if sum > 1000: flag1 = " " flagcount += 1 else: flag1 = "" if sum > highflag: highflag = sum print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1 sumtotal = sumtotal + sum print "Total ranges = ", sumtotal print "Total ranges over 10 million: ", flagcount print "Largest range: ", highflag == $ zcat test.zip 134873600, 134873855, "32787 Protex Technologies, Inc." 135338240, 135338495, 40597 135338496, 135338751, 40993 201720832, 201721087, "12838 HFF Infrastructure & Operations" 202739456, 202739711, "1623 Beseau Regional de la Region Languedoc Roussillon" ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process
After many more hours of reading and testing, I am still struggling to finish this simple script, which bear in mind, I already got my desired results by preprocessing with an awk one-liner. I am opening a zipped file properly, so I did make some progress, but simply assigning num1 and num2 to the first 2 columns of the file remains elusive. Num3 here gets assigned, not to the 3rd column, but the rest of the entire file. I feel like I am missing a simple strip() or some other incantation that prevents the entire file from getting blobbed into num3. Any help is appreciated in advance. #!/usr/bin/env python import string import re import zipfile highflag = flagcount = sum = sumtotal = 0 f = file("test.zip") z = zipfile.ZipFile(f) for f in z.namelist(): ranges = z.read(f) ranges = ranges.strip() num1, num2, num3 = re.split('\W+', ranges, 2) ## This line is the root of the problem. sum = int(num2) - int(num1) if sum > 1000: flag1 = " " flagcount += 1 else: flag1 = "" if sum > highflag: highflag = sum print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1 sumtotal = sumtotal + sum print "Total ranges = ", sumtotal print "Total ranges over 10 million: ", flagcount print "Largest range: ", highflag == $ zcat test.zip 134873600, 134873855, "32787 Protex Technologies, Inc." 135338240, 135338495, 40597 135338496, 135338751, 40993 201720832, 201721087, "12838 HFF Infrastructure & Operations" 202739456, 202739711, "1623 Beseau Regional de la Region Languedoc Roussillon" ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process
On Fri, Jan 8, 2010 at 2:24 AM, wrote: > So how do I > uncompress zip and gzipped files in Python, zipfile and gzip > and how do I force split to only > evaluate the first two columns? Use the optional second argument to split(): line.split(',', 2) > Better yet, can I tell split to not evaluate > commas in the double quoted 3rd column? Use the csv module. Kent ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Expanding a Python script to include a zcat and awk pre-process
galaxywatc...@gmail.com dixit: > I wrote a simple Python script to process a text file, but I had to > run a shell one liner to get the text file primed for the script. I > would much rather have the Python script handle the whole task without > any pre-processing at all. I will show 1) a small sample of the text > file, 2) my script, 3) the one liner that I want to fold into the > script, and 4) the task at hand. > > 1) $ zcat textfile.txt.zip | head -5 > 134873600, 134873855, "32787 Protex Technologies, Inc." > 135338240, 135338495, 40597 > 135338496, 135338751, 40993 > 201720832, 201721087, "12838 HFF Infrastructure & Operations" > 202739456, 202739711, "1623 Beseau Regional de la Region Languedoc > Roussillon" > > > 2) $ cat getranges.py > #!/usr/bin/env python > > import string > > highflag = flagcount = sum = sumtotal = 0 > infile = open("textfile.txt", "r") > # Find the range by subtracting column 1 from column 2 > for line in infile: > num1, num2 = string.split(line) > sum = int(num2) - int(num1) > if sum > 1000: > flag1 = " " > flagcount += 1 > if sum > highflag: > highflag = sum > else: > flag1 = "" > print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1 > sumtotal = sumtotal + sum > print "Total ranges = ", sumtotal > print "Total # of ranges over 10 million: ", flagcount > print "Largest range: ", highflag > > 3) zcat textfile.txt.zip | awk -F"," '{print $1, $2}' > textfile.txt > > 4) In my first iteration, I used string.split(num1, ",") but I ran > into trouble when I encountered commas within column 3, such as "32787 > Protexic Technologies, Inc.". I don't know how to handle this > exception. I also don't know how to uncompress the file in Python and > pass it to the rest of the script. Hence I used my zcat | awk oneliner > to get the job done. So how do I uncompress zip and gzipped files in > Python, and how do I force split to only evaluate the first two > columns? Better yet, can I tell split to not evaluate commas in the > double quoted 3rd column? > > Regards, > Blake There are several possibilities: 1) The choice of ',' as separator for data that can contain commas is , hem, not very clever ;-) Can you change that, so as to solve the issue at its source? (eg: any text processor allows converting a table to plain text using whatever separator). CSV is not a panacea... 2) Preprocess data to replace commas _outside quotes_ by a better chosen sep, such as TAB (eg read data while keeping an "in_quotes" flag). 3) Use a more powerful text processing tool, such as regexes: data = '''\ 134873600, 134873855, "32787 Protex Technologies, Inc." 135338240, 135338495, 40597 135338496, 135338751, 40993 201720832, 201721087, "12838 HFF Infrastructure & Operations" 202739456, 202739711, "1623 Beseau Regional de la Region Languedoc Roussillon"''' import re pat = re.compile(r"""(\d+), (\d+), (\".+\"|\d+)""") for line in data.splitlines(): result = pat.match(line) print result.groups() ==> ('134873600', '134873855', '"32787 Protex Technologies, Inc."') ('135338240', '135338495', '40597') ('135338496', '135338751', '40993') ('201720832', '201721087', '"12838 HFF Infrastructure & Operations"') ('202739456', '202739711', '"1623 Beseau Regional de la Region Languedoc Roussillon"') Denis la vita e estrany http://spir.wikidot.com/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] Expanding a Python script to include a zcat and awk pre-process
I wrote a simple Python script to process a text file, but I had to run a shell one liner to get the text file primed for the script. I would much rather have the Python script handle the whole task without any pre-processing at all. I will show 1) a small sample of the text file, 2) my script, 3) the one liner that I want to fold into the script, and 4) the task at hand. 1) $ zcat textfile.txt.zip | head -5 134873600, 134873855, "32787 Protex Technologies, Inc." 135338240, 135338495, 40597 135338496, 135338751, 40993 201720832, 201721087, "12838 HFF Infrastructure & Operations" 202739456, 202739711, "1623 Beseau Regional de la Region Languedoc Roussillon" 2) $ cat getranges.py #!/usr/bin/env python import string highflag = flagcount = sum = sumtotal = 0 infile = open("textfile.txt", "r") # Find the range by subtracting column 1 from column 2 for line in infile: num1, num2 = string.split(line) sum = int(num2) - int(num1) if sum > 1000: flag1 = " " flagcount += 1 if sum > highflag: highflag = sum else: flag1 = "" print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1 sumtotal = sumtotal + sum print "Total ranges = ", sumtotal print "Total # of ranges over 10 million: ", flagcount print "Largest range: ", highflag 3) zcat textfile.txt.zip | awk -F"," '{print $1, $2}' > textfile.txt 4) In my first iteration, I used string.split(num1, ",") but I ran into trouble when I encountered commas within column 3, such as "32787 Protexic Technologies, Inc.". I don't know how to handle this exception. I also don't know how to uncompress the file in Python and pass it to the rest of the script. Hence I used my zcat | awk oneliner to get the job done. So how do I uncompress zip and gzipped files in Python, and how do I force split to only evaluate the first two columns? Better yet, can I tell split to not evaluate commas in the double quoted 3rd column? Regards, Blake ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor