On Apr 6, 10:43 pm, Carl Banks <pavlovevide...@gmail.com> wrote: > MooMaster wrote: > > So I'm reading in values from a file, and for each column I need to > > dynamically discover the range of possible values it can take and > > quantize if necessary. This is the solution I've come up with: > > > <code> > > def createInitialCluster(fileName): > > #get the data from the file > > points = [] > > with open(fileName, 'r') as f: > > for line in f: > > points.append(line.rstrip('\n')) > > #clean up the data > > fixedPoints = [] > > for point in points: > > dimensions = [quantize(i, points, point.split(",").index(i)) > > for i in point.split(",")] > > print dimensions > > fixedPoints.append(Point(dimensions)) > > #return an initial cluster of all the points > > return Cluster(fixedPoints) > > > def quantize(stringToQuantize, pointList, columnIndex): > > #if it's numeric, no need to quantize > > if(isNumeric(stringToQuantize)): > > return float(stringToQuantize) > > #first we need to discover all the possible values of this column > > domain = [] > > for point in pointList: > > domain.append(point.split(",")[columnIndex]) > > #make it a set to remove duplicates > > domain = list(Set(domain)) > > #use the index into the domain as the number representing this > > value > > return float(domain.index(stringToQuantize)) > > > #harvested fromhttp://www.rosettacode.org/wiki/IsNumeric#Python > > def isNumeric(string): > > try: > > i = float(string) > > except ValueError: > > return False > > return True > > Big problem with this. I'm guessing you never ran it on a really big > file yet. Your quantize function does a lot of unnecessary work: it > rebuilds the list of indices for every single line, every single non- > numeric entry, in fact. (Tech-speak: that is N^2 behavior in both the > number of lines and number of non-numeric columns. Not good.) This > will work ok for a small file, but will take forever on a large file > (perhaps literally). > > So, forgetting about column indexing for a moment, we can improve this > vastly simply by generating the list of indices once. Do that in a > separete function, have it return the list, and then pass that list to > the quantize function. So replace the midportion of your > createInitialCluster function with something like this: > > .... > for i in xrange(len(points[0])): # number of columns > column_indices.append(quantize_column(points,i)) > fixedPoints = [] > for point in points: > dimensions = [quantize(s, column_indices[i], point.split > (",").index(i)) > for (i,s) in enumerate(point.split(","))] # need index > as well as entry here > print dimensions > fixedPoints.append(Point(dimensions)) > .... > > And the two functions would be something like this: > > def quantize_columns(point_list,column_index): > # this assumes if the first column is numeric the whole column > would be > if(isNumeric(point_list[0][column_index])): > return None # don't quantize this column > #first we need to discover all the possible values of this column > domain = [] > for point in point_list: > domain.append(point.split(",")[column_index]) > #make it a set to remove duplicates > return list(set(domain)) > > def quantize(i,domain,s): > if domain is None: > return float(s) > return float(domain.index(s)) > > This (once debugged :) will run much, much faster on a large dataset. > > Now back to your question. > > > It works, but it feels a little ugly, and not exactly Pythonic. Using > > two lists I need the original point list to read in the data, then the > > dimensions one to hold the processed point, and a fixedPoint list to > > make objects out of the processed data. If my dataset is in the order > > of millions, this'll nuke the memory. I tried something like: > > > for point in points: > > point = Point([quantize(i, points, point.split(",").index(i)) for i > > in point.split(",")]) > > but when I print out the points afterward, it doesn't keep the > > changes. > > It's because the order of items in a set is undefined. The order of > the items in list(set(["a","b","c","d"])) might be very different from > the order in list(set(["a","b","c","d","e"])). You were passing > quantize incomplete an incomplete list of points, so as the points > grew, the items in the set changed, and it messed up the order. In > fact, you should never rely on the order being the same, even if you > created the set with the very same arguments. > > What you are trying to do should be done with dictionaries: create a > dict that maps a value to a number. > > Now, based on your quantize function, it would seem that the number > associated with the value is arbitrary (doesn't matter what it is, as > long as it's distinct), so it isn't necessary to read the whole csv > file in before assigning numbers; just build the dict as you go. > > I suggest collections.defaultdict for this task. It's like a regular > dict, but it creates a new value any time you access a key that > doesn't exist, a perfect solution to your task. We'll pass it a > function that generates a different index each time it's called. > (This is probably too advanced for you, but oh well, it's such a cool > trick.) > > import collections > import itertools > > def createInitialCluster(fileName): > fixedPoints = [] > # quantization is a dict that assigns sequentially-increasing > numbers > # to values when reading keys that don't yet exit > quantization = defaultdict.collections(itertools.count().next) > with open(fileName, 'r') as f: > for line in f: > dimensions = [] > for s in line.rstrip('\n').split(","): > if isNumeric(s): > dimensions.append(float(s)) > else: > dimensions.append(float(quantization[s])) > fixedPoints.append(Point(dimensions)) > return Cluster(fixedPoints) > > A couple general pointers: > > * Don't ever use i to represent a string. Programmers expect i to be > an integer. i,j,k,l,m, and n should be integers, in fact. u,v,w,x,y, > and z should be floats. You should stick to this convention whenever > possible, but definitely never use i for anything but an integer. > > * set is built into Python now; unless you're using an older version > (2.3 I think) you should use set instead of Set. > > * The Python style guide (q.g.) recommends that variables use names > such as column_index rather than columnIndex. The world won't end if > you don't follow it but if you want to be Pythonic that's how. > > Carl Banks
Nice explanation! Very detailed and thorough, thanks! I haven't used defaultdict before, so that was a very useful and efficient trick! -- http://mail.python.org/mailman/listinfo/python-list