MooMaster wrote: > So I'm reading in values from a file, and for each column I need to > dynamically discover the range of possible values it can take and > quantize if necessary. This is the solution I've come up with: > > <code> > def createInitialCluster(fileName): > #get the data from the file > points = [] > with open(fileName, 'r') as f: > for line in f: > points.append(line.rstrip('\n')) > #clean up the data > fixedPoints = [] > for point in points: > dimensions = [quantize(i, points, point.split(",").index(i)) > for i in point.split(",")] > print dimensions > fixedPoints.append(Point(dimensions)) > #return an initial cluster of all the points > return Cluster(fixedPoints) > > def quantize(stringToQuantize, pointList, columnIndex): > #if it's numeric, no need to quantize > if(isNumeric(stringToQuantize)): > return float(stringToQuantize) > #first we need to discover all the possible values of this column > domain = [] > for point in pointList: > domain.append(point.split(",")[columnIndex]) > #make it a set to remove duplicates > domain = list(Set(domain)) > #use the index into the domain as the number representing this > value > return float(domain.index(stringToQuantize)) > > #harvested from http://www.rosettacode.org/wiki/IsNumeric#Python > def isNumeric(string): > try: > i = float(string) > except ValueError: > return False > return True
Big problem with this. I'm guessing you never ran it on a really big file yet. Your quantize function does a lot of unnecessary work: it rebuilds the list of indices for every single line, every single non- numeric entry, in fact. (Tech-speak: that is N^2 behavior in both the number of lines and number of non-numeric columns. Not good.) This will work ok for a small file, but will take forever on a large file (perhaps literally). So, forgetting about column indexing for a moment, we can improve this vastly simply by generating the list of indices once. Do that in a separete function, have it return the list, and then pass that list to the quantize function. So replace the midportion of your createInitialCluster function with something like this: .... for i in xrange(len(points[0])): # number of columns column_indices.append(quantize_column(points,i)) fixedPoints = [] for point in points: dimensions = [quantize(s, column_indices[i], point.split (",").index(i)) for (i,s) in enumerate(point.split(","))] # need index as well as entry here print dimensions fixedPoints.append(Point(dimensions)) .... And the two functions would be something like this: def quantize_columns(point_list,column_index): # this assumes if the first column is numeric the whole column would be if(isNumeric(point_list[0][column_index])): return None # don't quantize this column #first we need to discover all the possible values of this column domain = [] for point in point_list: domain.append(point.split(",")[column_index]) #make it a set to remove duplicates return list(set(domain)) def quantize(i,domain,s): if domain is None: return float(s) return float(domain.index(s)) This (once debugged :) will run much, much faster on a large dataset. Now back to your question. > It works, but it feels a little ugly, and not exactly Pythonic. Using > two lists I need the original point list to read in the data, then the > dimensions one to hold the processed point, and a fixedPoint list to > make objects out of the processed data. If my dataset is in the order > of millions, this'll nuke the memory. I tried something like: > > for point in points: > point = Point([quantize(i, points, point.split(",").index(i)) for i > in point.split(",")]) > but when I print out the points afterward, it doesn't keep the > changes. It's because the order of items in a set is undefined. The order of the items in list(set(["a","b","c","d"])) might be very different from the order in list(set(["a","b","c","d","e"])). You were passing quantize incomplete an incomplete list of points, so as the points grew, the items in the set changed, and it messed up the order. In fact, you should never rely on the order being the same, even if you created the set with the very same arguments. What you are trying to do should be done with dictionaries: create a dict that maps a value to a number. Now, based on your quantize function, it would seem that the number associated with the value is arbitrary (doesn't matter what it is, as long as it's distinct), so it isn't necessary to read the whole csv file in before assigning numbers; just build the dict as you go. I suggest collections.defaultdict for this task. It's like a regular dict, but it creates a new value any time you access a key that doesn't exist, a perfect solution to your task. We'll pass it a function that generates a different index each time it's called. (This is probably too advanced for you, but oh well, it's such a cool trick.) import collections import itertools def createInitialCluster(fileName): fixedPoints = [] # quantization is a dict that assigns sequentially-increasing numbers # to values when reading keys that don't yet exit quantization = defaultdict.collections(itertools.count().next) with open(fileName, 'r') as f: for line in f: dimensions = [] for s in line.rstrip('\n').split(","): if isNumeric(s): dimensions.append(float(s)) else: dimensions.append(float(quantization[s])) fixedPoints.append(Point(dimensions)) return Cluster(fixedPoints) A couple general pointers: * Don't ever use i to represent a string. Programmers expect i to be an integer. i,j,k,l,m, and n should be integers, in fact. u,v,w,x,y, and z should be floats. You should stick to this convention whenever possible, but definitely never use i for anything but an integer. * set is built into Python now; unless you're using an older version (2.3 I think) you should use set instead of Set. * The Python style guide (q.g.) recommends that variables use names such as column_index rather than columnIndex. The world won't end if you don't follow it but if you want to be Pythonic that's how. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list