Re: in place list modification necessary? What's a better idiom?
On Apr 7, 12:40 am, Peter Otten <__pete...@web.de> wrote: > Peter Otten wrote: > > MooMaster wrote: > > >> Now we can't calculate a meaningful Euclidean distance for something > >> like "Iris-setosa" and "Iris-versicolor" unless we use string-edit > >> distance or something overly complicated, so instead we'll use a > >> simple quantization scheme of enumerating the set of values within the > >> column domain and replacing the strings with numbers (i.e. Iris-setosa > >> = 1, iris-versicolor=2). > > > I'd calculate the distance as > > > def string_dist(x, y, weight=1): > > return weight * (x == y) > > oops, this must of course be (x != y). > > > You don't get a high resolution in that dimension, but you don't introduce > > an element of randomness, either. > > > Peter > > The randomness doesn't matter too much, all K-means cares about is a distance between two points in a coordinate space and as long as that space is invariant it doesn't matter too much (i.e. we don't want (1,1) becoming (3,1) on the next iteration, or the value for a quantized column changing). With that in mind, I was hoping to be lazy and just go with an enumeration approach... Nevertheless, it does introduce a subtle ordering for nominal data, as if Iris-Setosa =1, Iris-Versicolor=2, and Iris-Virginica=3 then on that scale Iris-Versicolor is intuitively "closer" to virginica than setosa is, when in fact such distances don't mean anything on a nominal scale. I hadn't thought about a function like that, but it makes a lot of sense. Thanks! -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
On Apr 6, 10:43 pm, Carl Banks wrote: > MooMaster wrote: > > So I'm reading in values from a file, and for each column I need to > > dynamically discover the range of possible values it can take and > > quantize if necessary. This is the solution I've come up with: > > > > > def createInitialCluster(fileName): > > #get the data from the file > > points = [] > > with open(fileName, 'r') as f: > > for line in f: > > points.append(line.rstrip('\n')) > > #clean up the data > > fixedPoints = [] > > for point in points: > > dimensions = [quantize(i, points, point.split(",").index(i)) > > for i in point.split(",")] > > print dimensions > > fixedPoints.append(Point(dimensions)) > > #return an initial cluster of all the points > > return Cluster(fixedPoints) > > > def quantize(stringToQuantize, pointList, columnIndex): > > #if it's numeric, no need to quantize > > if(isNumeric(stringToQuantize)): > > return float(stringToQuantize) > > #first we need to discover all the possible values of this column > > domain = [] > > for point in pointList: > > domain.append(point.split(",")[columnIndex]) > > #make it a set to remove duplicates > > domain = list(Set(domain)) > > #use the index into the domain as the number representing this > > value > > return float(domain.index(stringToQuantize)) > > > #harvested fromhttp://www.rosettacode.org/wiki/IsNumeric#Python > > def isNumeric(string): > > try: > > i = float(string) > > except ValueError: > > return False > > return True > > Big problem with this. I'm guessing you never ran it on a really big > file yet. Your quantize function does a lot of unnecessary work: it > rebuilds the list of indices for every single line, every single non- > numeric entry, in fact. (Tech-speak: that is N^2 behavior in both the > number of lines and number of non-numeric columns. Not good.) This > will work ok for a small file, but will take forever on a large file > (perhaps literally). > > So, forgetting about column indexing for a moment, we can improve this > vastly simply by generating the list of indices once. Do that in a > separete function, have it return the list, and then pass that list to > the quantize function. So replace the midportion of your > createInitialCluster function with something like this: > > > for i in xrange(len(points[0])): # number of columns > column_indices.append(quantize_column(points,i)) > fixedPoints = [] > for point in points: > dimensions = [quantize(s, column_indices[i], point.split > (",").index(i)) > for (i,s) in enumerate(point.split(","))] # need index > as well as entry here > print dimensions > fixedPoints.append(Point(dimensions)) > > > And the two functions would be something like this: > > def quantize_columns(point_list,column_index): > # this assumes if the first column is numeric the whole column > would be > if(isNumeric(point_list[0][column_index])): > return None # don't quantize this column > #first we need to discover all the possible values of this column > domain = [] > for point in point_list: > domain.append(point.split(",")[column_index]) > #make it a set to remove duplicates > return list(set(domain)) > > def quantize(i,domain,s): > if domain is None: > return float(s) > return float(domain.index(s)) > > This (once debugged :) will run much, much faster on a large dataset. > > Now back to your question. > > > It works, but it feels a little ugly, and not exactly Pythonic. Using > > two lists I need the original point list to read in the data, then the > > dimensions one to hold the processed point, and a fixedPoint list to > > make objects out of the processed data. If my dataset is in the order > > of millions, this'll nuke the memory. I tried something like: > > > for point in points: > > point = Point([quantize(i, points, point.split(",").index(i)) for i > > in point.split(",")]) > > but when I print out the points afterward, it doesn't keep the > > changes. > > It's because the order of items in a set is undefined. The order of > the items in list(set(["a","b","c","d"])) might be very different from > the order in list(set(["a","b","c","d","e"])). You were passing > quantize incomplete an incomplete list of points, so as the points > grew, the items in the set changed, and it messed up the order. In > fact, you should never rely on the order being the same, even if you > created the set with the very same arguments. > > What you are trying to do should be done with dictionaries: create a > dict that maps a value to a number. > > Now, based on your quantize function, it would seem that the number > associated with the value is arbitrary (doesn't matter what it is, as > long as it's distinct), so it isn't necessary to read the wh
Re: in place list modification necessary? What's a better idiom?
MRAB wrote: > andrew cooke wrote: >> R. David Murray wrote: [...] try: dimensions.append(float(s)) except: dimensions.append(float(quantization[s])) >>> No, no, no; never use a bare except! :) >> >> can you explain why? i can't think of any reason why the code would be >> better catching a specific exception. >> >> as a general rule, maybe, but in this particular case i can't see a >> reason >> - so i'm not sure if you're just pedantically following rules or if i've >> missed something i should know. >> > A bare exception will catch _any_ exception, even those you didn't > expect (technical term: "bug" :-)); for example, if you had written: > > try: > dimension.append(float(s)) > except: > dimensions.append(float(quantization[s])) > > it would silently catch "NameError: name 'dimension' is not defined" and > perform the fallback action. true, i hadn't considered coding errors (i was thinking that something global, like out of memory, would fail again anyway). andrew -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
On Tue, 7 Apr 2009 at 09:01, andrew cooke wrote: R. David Murray wrote: [...] try: dimensions.append(float(s)) except: dimensions.append(float(quantization[s])) No, no, no; never use a bare except! :) can you explain why? i can't think of any reason why the code would be better catching a specific exception. as a general rule, maybe, but in this particular case i can't see a reason - so i'm not sure if you're just pedantically following rules or if i've missed something i should know. What if the user pressed ctl-c right when the float was being converted and appended? Never use a bare except unless you have a specific reason to do so, and there are very few of those. (Yes, I should have said it that way to start with, my apologies for going hyperbolic.) Using because it doesn't _look_ like it will cause issues is just asking for hard to track down bugs :). -- R. David Murray http://www.bitdance.com -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
andrew cooke wrote: R. David Murray wrote: [...] try: dimensions.append(float(s)) except: dimensions.append(float(quantization[s])) No, no, no; never use a bare except! :) can you explain why? i can't think of any reason why the code would be better catching a specific exception. as a general rule, maybe, but in this particular case i can't see a reason - so i'm not sure if you're just pedantically following rules or if i've missed something i should know. A bare exception will catch _any_ exception, even those you didn't expect (technical term: "bug" :-)); for example, if you had written: try: dimension.append(float(s)) except: dimensions.append(float(quantization[s])) it would silently catch "NameError: name 'dimension' is not defined" and perform the fallback action. -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
R. David Murray wrote: >> [...] >> try: >> dimensions.append(float(s)) >> except: >> dimensions.append(float(quantization[s])) > > No, no, no; never use a bare except! :) can you explain why? i can't think of any reason why the code would be better catching a specific exception. as a general rule, maybe, but in this particular case i can't see a reason - so i'm not sure if you're just pedantically following rules or if i've missed something i should know. andrew -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
just playing around - doesn't work with 3.0 due to lack of pattern binding (which i think is coming back in 3.1?) >>> from collections import defaultdict >>> from itertools import count >>> def mkdict(): return defaultdict(count().next) ... >>> converters = defaultdict(mkdict) >>> def to_float((i, s)): ... try: return float(s) ... except: return converters[i][s] ... >>> def convert(line): ... return map(to_float, zip(count(), line.split(','))) ... >>> convert('1,2,red') [1.0, 2.0, 0] >>> convert('1,2,red') [1.0, 2.0, 0] >>> convert('1,2,blue') [1.0, 2.0, 1] >>> convert('1,2,blue,blue') [1.0, 2.0, 1, 0] andrew cooke wrote: > Carl Banks wrote: >> import collections >> import itertools >> >> def createInitialCluster(fileName): >> fixedPoints = [] >> # quantization is a dict that assigns sequentially-increasing >> numbers >> # to values when reading keys that don't yet exit >> quantization = defaultdict.collections(itertools.count().next) >> with open(fileName, 'r') as f: >> for line in f: >> dimensions = [] >> for s in line.rstrip('\n').split(","): >> if isNumeric(s): >> dimensions.append(float(s)) >> else: >> dimensions.append(float(quantization[s])) >> fixedPoints.append(Point(dimensions)) >> return Cluster(fixedPoints) > > nice reply (i didn't know defaultdict worked like that - very neat). > > two small things i noticed: > > 1 - do you need a separate quantization for each column? the code above > might give, for example, non-contiguous ranges of integers for a > particular column if a string occurs ("by accident" perhaps) in more than > one. > > 2 - don't bother with isNumeric. just return the cast value or catch the > exception: > > [...] > try: > dimensions.append(float(s)) > except: > dimensions.append(float(quantization[s])) > > (not sure float() is needed there either if you're using a recent version > of python - only reason i can think of is to avoid integer division in > older versions). > > andrew > > > -- > http://mail.python.org/mailman/listinfo/python-list > > -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
"andrew cooke" wrote: > Carl Banks wrote: > > import collections > > import itertools > > > > def createInitialCluster(fileName): > > fixedPoints = [] > > # quantization is a dict that assigns sequentially-increasing > > numbers > > # to values when reading keys that don't yet exit > > quantization = defaultdict.collections(itertools.count().next) > > with open(fileName, 'r') as f: > > for line in f: > > dimensions = [] > > for s in line.rstrip('\n').split(","): > > if isNumeric(s): > > dimensions.append(float(s)) > > else: > > dimensions.append(float(quantization[s])) > > fixedPoints.append(Point(dimensions)) > > return Cluster(fixedPoints) > > nice reply (i didn't know defaultdict worked like that - very neat). > > two small things i noticed: > > 1 - do you need a separate quantization for each column? the code above > might give, for example, non-contiguous ranges of integers for a > particular column if a string occurs ("by accident" perhaps) in more than > one. > > 2 - don't bother with isNumeric. just return the cast value or catch the > exception: > > [...] > try: > dimensions.append(float(s)) > except: > dimensions.append(float(quantization[s])) No, no, no; never use a bare except! :) Do it MRAB's way and catch ValueError. -- R. David Murray http://www.bitdance.com -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
Carl Banks wrote: MooMaster wrote: So I'm reading in values from a file, and for each column I need to dynamically discover the range of possible values it can take and quantize if necessary. This is the solution I've come up with: [snip] #harvested from http://www.rosettacode.org/wiki/IsNumeric#Python def isNumeric(string): try: i = float(string) except ValueError: return False return True [snip] import collections import itertools def createInitialCluster(fileName): fixedPoints = [] # quantization is a dict that assigns sequentially-increasing numbers # to values when reading keys that don't yet exit quantization = defaultdict.collections(itertools.count().next) with open(fileName, 'r') as f: for line in f: dimensions = [] for s in line.rstrip('\n').split(","): if isNumeric(s): dimensions.append(float(s)) else: dimensions.append(float(quantization[s])) fixedPoints.append(Point(dimensions)) return Cluster(fixedPoints) I'd also remove the isNumeric() function. You're just converting to a float if you can successfully convert to a float! try: dimensions.append(float(s)) except ValueError: dimensions.append(float(quantization[s])) -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
Carl Banks wrote: > import collections > import itertools > > def createInitialCluster(fileName): > fixedPoints = [] > # quantization is a dict that assigns sequentially-increasing > numbers > # to values when reading keys that don't yet exit > quantization = defaultdict.collections(itertools.count().next) > with open(fileName, 'r') as f: > for line in f: > dimensions = [] > for s in line.rstrip('\n').split(","): > if isNumeric(s): > dimensions.append(float(s)) > else: > dimensions.append(float(quantization[s])) > fixedPoints.append(Point(dimensions)) > return Cluster(fixedPoints) nice reply (i didn't know defaultdict worked like that - very neat). two small things i noticed: 1 - do you need a separate quantization for each column? the code above might give, for example, non-contiguous ranges of integers for a particular column if a string occurs ("by accident" perhaps) in more than one. 2 - don't bother with isNumeric. just return the cast value or catch the exception: [...] try: dimensions.append(float(s)) except: dimensions.append(float(quantization[s])) (not sure float() is needed there either if you're using a recent version of python - only reason i can think of is to avoid integer division in older versions). andrew -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
Carl Banks wrote: > On Apr 7, 12:38 am, Peter Otten <__pete...@web.de> wrote: >> MooMaster wrote: >> > Now we can't calculate a meaningful Euclidean distance for something >> > like "Iris-setosa" and "Iris-versicolor" unless we use string-edit >> > distance or something overly complicated, so instead we'll use a >> > simple quantization scheme of enumerating the set of values within the >> > column domain and replacing the strings with numbers (i.e. Iris-setosa >> > = 1, iris-versicolor=2). >> >> I'd calculate the distance as >> >> def string_dist(x, y, weight=1): >> return weight * (x == y) >> >> You don't get a high resolution in that dimension, but you don't >> introduce an element of randomness, either. > > Does the algorithm require well-ordered data along the dimensions? > Though I've never heard of it, the fact that it's called "bisecting > Kmeans" suggests to me that it does, which means this wouldn't work. I've read about K-Means in Segaran's "Collective Intelligence" which describes it: "K-Means clustering begins with k randomly placed centroids (points in space that represent the center of the cluster), and assigns every item to the nearest one. After the assignment, the centroids are moved to the average location of all the nodes assigned to them, and the assignments are redone. This process repeats until the assignments stop changing." The book doesn't go into the theory, and "any distance would do" was my assumption which may be wrong. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
On Apr 7, 12:38 am, Peter Otten <__pete...@web.de> wrote: > MooMaster wrote: > > Now we can't calculate a meaningful Euclidean distance for something > > like "Iris-setosa" and "Iris-versicolor" unless we use string-edit > > distance or something overly complicated, so instead we'll use a > > simple quantization scheme of enumerating the set of values within the > > column domain and replacing the strings with numbers (i.e. Iris-setosa > > = 1, iris-versicolor=2). > > I'd calculate the distance as > > def string_dist(x, y, weight=1): > return weight * (x == y) > > You don't get a high resolution in that dimension, but you don't introduce > an element of randomness, either. Does the algorithm require well-ordered data along the dimensions? Though I've never heard of it, the fact that it's called "bisecting Kmeans" suggests to me that it does, which means this wouldn't work. However, the OP better be sure to set the scales for the quantized dimensions high enough so that no clusters form containing points with different discrete values. That, in turn, suggests he might as well not even bother sending the discrete values to the clustering algorithm, but instead to call it for each unique set of discretes. (However, I could imagine the marginal cost of more dimensions is less than that of multiple runs; I've been dealing with such a case at work.) I'll leave it to the OP to decide. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
Peter Otten wrote: > MooMaster wrote: > >> Now we can't calculate a meaningful Euclidean distance for something >> like "Iris-setosa" and "Iris-versicolor" unless we use string-edit >> distance or something overly complicated, so instead we'll use a >> simple quantization scheme of enumerating the set of values within the >> column domain and replacing the strings with numbers (i.e. Iris-setosa >> = 1, iris-versicolor=2). > > I'd calculate the distance as > > def string_dist(x, y, weight=1): > return weight * (x == y) oops, this must of course be (x != y). > You don't get a high resolution in that dimension, but you don't introduce > an element of randomness, either. > > Peter > -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
MooMaster wrote: > Now we can't calculate a meaningful Euclidean distance for something > like "Iris-setosa" and "Iris-versicolor" unless we use string-edit > distance or something overly complicated, so instead we'll use a > simple quantization scheme of enumerating the set of values within the > column domain and replacing the strings with numbers (i.e. Iris-setosa > = 1, iris-versicolor=2). I'd calculate the distance as def string_dist(x, y, weight=1): return weight * (x == y) You don't get a high resolution in that dimension, but you don't introduce an element of randomness, either. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: in place list modification necessary? What's a better idiom?
MooMaster wrote: > So I'm reading in values from a file, and for each column I need to > dynamically discover the range of possible values it can take and > quantize if necessary. This is the solution I've come up with: > > > def createInitialCluster(fileName): > #get the data from the file > points = [] > with open(fileName, 'r') as f: > for line in f: > points.append(line.rstrip('\n')) > #clean up the data > fixedPoints = [] > for point in points: > dimensions = [quantize(i, points, point.split(",").index(i)) > for i in point.split(",")] > print dimensions > fixedPoints.append(Point(dimensions)) > #return an initial cluster of all the points > return Cluster(fixedPoints) > > def quantize(stringToQuantize, pointList, columnIndex): > #if it's numeric, no need to quantize > if(isNumeric(stringToQuantize)): > return float(stringToQuantize) > #first we need to discover all the possible values of this column > domain = [] > for point in pointList: > domain.append(point.split(",")[columnIndex]) > #make it a set to remove duplicates > domain = list(Set(domain)) > #use the index into the domain as the number representing this > value > return float(domain.index(stringToQuantize)) > > #harvested from http://www.rosettacode.org/wiki/IsNumeric#Python > def isNumeric(string): > try: > i = float(string) > except ValueError: > return False > return True Big problem with this. I'm guessing you never ran it on a really big file yet. Your quantize function does a lot of unnecessary work: it rebuilds the list of indices for every single line, every single non- numeric entry, in fact. (Tech-speak: that is N^2 behavior in both the number of lines and number of non-numeric columns. Not good.) This will work ok for a small file, but will take forever on a large file (perhaps literally). So, forgetting about column indexing for a moment, we can improve this vastly simply by generating the list of indices once. Do that in a separete function, have it return the list, and then pass that list to the quantize function. So replace the midportion of your createInitialCluster function with something like this: for i in xrange(len(points[0])): # number of columns column_indices.append(quantize_column(points,i)) fixedPoints = [] for point in points: dimensions = [quantize(s, column_indices[i], point.split (",").index(i)) for (i,s) in enumerate(point.split(","))] # need index as well as entry here print dimensions fixedPoints.append(Point(dimensions)) And the two functions would be something like this: def quantize_columns(point_list,column_index): # this assumes if the first column is numeric the whole column would be if(isNumeric(point_list[0][column_index])): return None # don't quantize this column #first we need to discover all the possible values of this column domain = [] for point in point_list: domain.append(point.split(",")[column_index]) #make it a set to remove duplicates return list(set(domain)) def quantize(i,domain,s): if domain is None: return float(s) return float(domain.index(s)) This (once debugged :) will run much, much faster on a large dataset. Now back to your question. > It works, but it feels a little ugly, and not exactly Pythonic. Using > two lists I need the original point list to read in the data, then the > dimensions one to hold the processed point, and a fixedPoint list to > make objects out of the processed data. If my dataset is in the order > of millions, this'll nuke the memory. I tried something like: > > for point in points: >point = Point([quantize(i, points, point.split(",").index(i)) for i > in point.split(",")]) > but when I print out the points afterward, it doesn't keep the > changes. It's because the order of items in a set is undefined. The order of the items in list(set(["a","b","c","d"])) might be very different from the order in list(set(["a","b","c","d","e"])). You were passing quantize incomplete an incomplete list of points, so as the points grew, the items in the set changed, and it messed up the order. In fact, you should never rely on the order being the same, even if you created the set with the very same arguments. What you are trying to do should be done with dictionaries: create a dict that maps a value to a number. Now, based on your quantize function, it would seem that the number associated with the value is arbitrary (doesn't matter what it is, as long as it's distinct), so it isn't necessary to read the whole csv file in before assigning numbers; just build the dict as you go. I suggest collections.defaultdict for this task. It's like a regular dict, but it creates a new value any time you access a key that doesn't exist, a perfect solution to your task.
in place list modification necessary? What's a better idiom?
A similar discussion has already occurred, over 4 years ago: http://groups.google.com/group/comp.lang.python/browse_thread/thread/b806ada0732643d/5dff55826a199928?lnk=gst&q=list+in+place#5dff55826a199928 Nevertheless, I have a use-case where such a discussion comes up. For my data mining class I'm writing an implementation of the bisecting KMeans clustering algorithm (if you're not familiar with clustering and are interested, this gives a decent example based overview: http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt). Given a CSV dataset of n records, we are to cluster them accordingly. The dataset is generalizable enough to have any kind of data-type (strings, floats, booleans, etc) for each of the record's columnar values, for example here's a couple of records from the famous iris dataset: 5.1,3.5,1.4,0.2,Iris-setosa 6.4,3.2,4.5,1.5,Iris-versicolor Now we can't calculate a meaningful Euclidean distance for something like "Iris-setosa" and "Iris-versicolor" unless we use string-edit distance or something overly complicated, so instead we'll use a simple quantization scheme of enumerating the set of values within the column domain and replacing the strings with numbers (i.e. Iris-setosa = 1, iris-versicolor=2). So I'm reading in values from a file, and for each column I need to dynamically discover the range of possible values it can take and quantize if necessary. This is the solution I've come up with: def createInitialCluster(fileName): #get the data from the file points = [] with open(fileName, 'r') as f: for line in f: points.append(line.rstrip('\n')) #clean up the data fixedPoints = [] for point in points: dimensions = [quantize(i, points, point.split(",").index(i)) for i in point.split(",")] print dimensions fixedPoints.append(Point(dimensions)) #return an initial cluster of all the points return Cluster(fixedPoints) def quantize(stringToQuantize, pointList, columnIndex): #if it's numeric, no need to quantize if(isNumeric(stringToQuantize)): return float(stringToQuantize) #first we need to discover all the possible values of this column domain = [] for point in pointList: domain.append(point.split(",")[columnIndex]) #make it a set to remove duplicates domain = list(Set(domain)) #use the index into the domain as the number representing this value return float(domain.index(stringToQuantize)) #harvested from http://www.rosettacode.org/wiki/IsNumeric#Python def isNumeric(string): try: i = float(string) except ValueError: return False return True It works, but it feels a little ugly, and not exactly Pythonic. Using two lists I need the original point list to read in the data, then the dimensions one to hold the processed point, and a fixedPoint list to make objects out of the processed data. If my dataset is in the order of millions, this'll nuke the memory. I tried something like: for point in points: point = Point([quantize(i, points, point.split(",").index(i)) for i in point.split(",")]) but when I print out the points afterward, it doesn't keep the changes. What's a more efficient way of doing this? -- http://mail.python.org/mailman/listinfo/python-list