I'm writing my first 'real' program, i.e. that has a purpose aside from serving as a learning exercise. I'm posting to solicit comments about my efforts at translating strings from an external source into useful data, regarding efficiency and 'pythonicity' both. My only significant programming experience is in PostScript, and I feel that I haven't yet 'found my feet' concerning the object-oriented aspects of Python, so I'd be especially interested to know where I may be neglecting to take advantage of them.
My input is in the form of correlated lists of strings, which I want to merge (while ignoring some extraneous items). I populate a dictionary called "found" with these data, still in string form. It contains sub-dictionaries of various items keyed to strings extracted from the list "names"; these sub-dictionaries in turn contain the associated items I want from "cells". After loading in the strings (I have omitted the statements that pick up strings that require no further processing, some of them coming from a third list), I convert selected items in place. Here's the function I wrote: def extract_data(): i = 0 while i < len(names): name = names[i][6:] # strip off "Name: " found[name] = {'epoch1': cells[10 * i + na], 'epoch2': cells[10 * i + na + 1], 'time': cells[10 * i + na + 5], 'score1': cells[10 * i + na + 6], 'score2': cells[10 * i + na + 7]} ### Following is my first parsing step, for those data that represent real numbers. The two obstacles I'm contending with here are that the figures have commas grouping the digits in threes, and that sometimes the data are non-numeric -- I'll deal with those later. Is there a more elegant way of removing the commas than the split-and-rejoin below? ### for k in ('time', 'score1', 'score2'): v = found[name][k] if v != "---" and v != "n/a": # skip non-numeric data v = ''.join(v.split(",")) # remove commas between 000s found[name][k] = float(v) ### The next one is much messier. A couple of the strings represent times, which I think will be most useful in 'native' form, but the input is in the format "DD Mth YYYY HH:MM:SS UTC". Near the beginning of my program I have "from calendar import timegm". Before I can feed the data to this function, though, I have to convert the month abbreviation to a number. I couldn't come up with anything more elegant than look-up from a list: the relevant part of my initialization is ''' m_abbrevs = ("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") ''' I'm also rather unhappy with the way I kluged the seventh and eighth values in the tuple passed to timegm, the order of the date in the week and in the year respectively. (I would hate to have to calculate them.) The function doesn't seem to care what values I give it for these -- as long as I don't omit them -- so I guess they're only there for the sake of matching the output of the inverse function. Is there a version of timegm that takes a tuple of only six (or seven) elements, or any better way to handle this situation? ### for k in ('epoch1', 'epoch2'): dlist = found[name][k].split(" ") m = 0 while m < 12: if m_abbrevs[m] == dlist[1]: dlist[1] = m + 1 break m += 1 tlist = dlist[3].split(":") found[name][k] = timegm((int(dlist[2]), int(dlist[1]), int(dlist[0]), int(tlist[0]), int(tlist[1]), int(tlist[2]), -1, -1, 0)) i += 1 The function appears to be working OK as is, but I would welcome any & all suggestions for improving it or making it more idiomatic. -- Odysseus -- http://mail.python.org/mailman/listinfo/python-list