Re: in place list modification necessary? What's a better idiom?

2009-04-08 Thread MooMaster
On Apr 7, 12:40 am, Peter Otten <__pete...@web.de> wrote:
> Peter Otten wrote:
> > MooMaster wrote:
>
> >> Now we can't calculate a meaningful Euclidean distance for something
> >> like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
> >> distance or something overly complicated, so instead we'll use a
> >> simple quantization scheme of enumerating the set of values within the
> >> column domain and replacing the strings with numbers (i.e. Iris-setosa
> >> = 1, iris-versicolor=2).
>
> > I'd calculate the distance as
>
> > def string_dist(x, y, weight=1):
> >     return weight * (x == y)
>
> oops, this must of course be (x != y).
>
> > You don't get a high resolution in that dimension, but you don't introduce
> > an element of randomness, either.
>
> > Peter
>
>

The randomness doesn't matter too much, all K-means cares about is a
distance between two points in a coordinate space and as long as that
space is invariant it doesn't matter too much (i.e. we don't want
(1,1) becoming (3,1) on the next iteration, or the value for a
quantized column changing). With that in mind, I was hoping to be lazy
and just go with an enumeration approach...

Nevertheless, it does introduce a subtle ordering for nominal data, as
if Iris-Setosa =1, Iris-Versicolor=2, and Iris-Virginica=3 then on
that scale Iris-Versicolor is intuitively "closer" to virginica than
setosa is, when in fact such distances don't mean anything on a
nominal scale. I hadn't thought about a function like that, but it
makes a lot of sense. Thanks!
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-08 Thread MooMaster
On Apr 6, 10:43 pm, Carl Banks  wrote:
> MooMaster wrote:
> > So I'm reading in values from a file, and for each column I need to
> > dynamically discover the range of possible values it can take and
> > quantize if necessary. This is the solution I've come up with:
>
> > 
> > def createInitialCluster(fileName):
> >     #get the data from the file
> >     points = []
> >     with open(fileName, 'r') as f:
> >         for line in f:
> >             points.append(line.rstrip('\n'))
> >     #clean up the data
> >     fixedPoints = []
> >     for point in points:
> >         dimensions = [quantize(i, points, point.split(",").index(i))
> > for i in point.split(",")]
> >         print dimensions
> >         fixedPoints.append(Point(dimensions))
> >     #return an initial cluster of all the points
> >     return Cluster(fixedPoints)
>
> > def quantize(stringToQuantize, pointList, columnIndex):
> >     #if it's numeric, no need to quantize
> >     if(isNumeric(stringToQuantize)):
> >         return float(stringToQuantize)
> >     #first we need to discover all the possible values of this column
> >     domain = []
> >     for point in pointList:
> >         domain.append(point.split(",")[columnIndex])
> >     #make it a set to remove duplicates
> >     domain = list(Set(domain))
> >     #use the index into the domain as the number representing this
> > value
> >     return float(domain.index(stringToQuantize))
>
> > #harvested fromhttp://www.rosettacode.org/wiki/IsNumeric#Python
> > def isNumeric(string):
> >     try:
> >         i = float(string)
> >     except ValueError:
> >         return False
> >     return True
>
> Big problem with this.  I'm guessing you never ran it on a really big
> file yet.  Your quantize function does a lot of unnecessary work: it
> rebuilds the list of indices for every single line, every single non-
> numeric entry, in fact.  (Tech-speak: that is N^2 behavior in both the
> number of lines and number of non-numeric columns.  Not good.)  This
> will work ok for a small file, but will take forever on a large file
> (perhaps literally).
>
> So, forgetting about column indexing for a moment, we can improve this
> vastly simply by generating the list of indices once.  Do that in a
> separete function, have it return the list, and then pass that list to
> the quantize function.  So replace the midportion of your
> createInitialCluster function with something like this:
>
>     
>     for i in xrange(len(points[0])): # number of columns
>         column_indices.append(quantize_column(points,i))
>     fixedPoints = []
>     for point in points:
>         dimensions = [quantize(s, column_indices[i], point.split
> (",").index(i))
>                 for (i,s) in enumerate(point.split(","))] # need index
> as well as entry here
>         print dimensions
>         fixedPoints.append(Point(dimensions))
>     
>
> And the two functions would be something like this:
>
> def quantize_columns(point_list,column_index):
>     # this assumes if the first column is numeric the whole column
> would be
>     if(isNumeric(point_list[0][column_index])):
>         return None # don't quantize this column
>     #first we need to discover all the possible values of this column
>     domain = []
>     for point in point_list:
>         domain.append(point.split(",")[column_index])
>     #make it a set to remove duplicates
>     return list(set(domain))
>
> def quantize(i,domain,s):
>     if domain is None:
>         return float(s)
>     return float(domain.index(s))
>
> This (once debugged :) will run much, much faster on a large dataset.
>
> Now back to your question.
>
> > It works, but it feels a little ugly, and not exactly Pythonic. Using
> > two lists I need the original point list to read in the data, then the
> > dimensions one to hold the processed point, and a fixedPoint list to
> > make objects out of the processed data. If my dataset is in the order
> > of millions, this'll nuke the memory. I tried something like:
>
> > for point in points:
> >    point = Point([quantize(i, points, point.split(",").index(i)) for i
> > in point.split(",")])
> > but when I print out the points afterward, it doesn't keep the
> > changes.
>
> It's because the order of items in a set is undefined.  The order of
> the items in list(set(["a","b","c","d"])) might be very different from
> the order in list(set(["a","b","c","d","e"])).  You were passing
> quantize incomplete an incomplete list of points, so as the points
> grew, the items in the set changed, and it messed up the order.  In
> fact,  you should never rely on the order being the same, even if you
> created the set with the very same arguments.
>
> What you are trying to do should be done with dictionaries: create a
> dict that maps a value to a number.
>
> Now, based on your quantize function, it would seem that the number
> associated with the value is arbitrary (doesn't matter what it is, as
> long as it's distinct), so it isn't necessary to read the wh

Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread andrew cooke
MRAB wrote:
> andrew cooke wrote:
>> R. David Murray wrote:
   [...]
   try:
 dimensions.append(float(s))
   except:
 dimensions.append(float(quantization[s]))
>>> No, no, no; never use a bare except! :)
>>
>> can you explain why?  i can't think of any reason why the code would be
>> better catching a specific exception.
>>
>> as a general rule, maybe, but in this particular case i can't see a
>> reason
>> - so i'm not sure if you're just pedantically following rules or if i've
>> missed something i should know.
>>
> A bare exception will catch _any_ exception, even those you didn't
> expect (technical term: "bug" :-)); for example, if you had written:
>
>  try:
>  dimension.append(float(s))
>  except:
>  dimensions.append(float(quantization[s]))
>
> it would silently catch "NameError: name 'dimension' is not defined" and
> perform the fallback action.

true, i hadn't considered coding errors (i was thinking that something
global, like out of memory, would fail again anyway).  andrew

--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread R. David Murray

On Tue, 7 Apr 2009 at 09:01, andrew cooke wrote:

R. David Murray wrote:

  [...]
  try:
dimensions.append(float(s))
  except:
dimensions.append(float(quantization[s]))


No, no, no; never use a bare except! :)


can you explain why?  i can't think of any reason why the code would be
better catching a specific exception.

as a general rule, maybe, but in this particular case i can't see a reason
- so i'm not sure if you're just pedantically following rules or if i've
missed something i should know.


What if the user pressed ctl-c right when the float was being converted
and appended?

Never use a bare except unless you have a specific reason to do so,
and there are very few of those.  (Yes, I should have said it that way
to start with, my apologies for going hyperbolic.)  Using because it
doesn't _look_ like it will cause issues is just asking for hard to
track down bugs :).

--
R. David Murray http://www.bitdance.com
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread MRAB

andrew cooke wrote:

R. David Murray wrote:

  [...]
  try:
dimensions.append(float(s))
  except:
dimensions.append(float(quantization[s]))

No, no, no; never use a bare except! :)


can you explain why?  i can't think of any reason why the code would be
better catching a specific exception.

as a general rule, maybe, but in this particular case i can't see a reason
- so i'm not sure if you're just pedantically following rules or if i've
missed something i should know.


A bare exception will catch _any_ exception, even those you didn't
expect (technical term: "bug" :-)); for example, if you had written:

try:
dimension.append(float(s))
except:
dimensions.append(float(quantization[s]))

it would silently catch "NameError: name 'dimension' is not defined" and
perform the fallback action.
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread andrew cooke
R. David Murray wrote:
>>   [...]
>>   try:
>> dimensions.append(float(s))
>>   except:
>> dimensions.append(float(quantization[s]))
>
> No, no, no; never use a bare except! :)

can you explain why?  i can't think of any reason why the code would be
better catching a specific exception.

as a general rule, maybe, but in this particular case i can't see a reason
- so i'm not sure if you're just pedantically following rules or if i've
missed something i should know.

andrew

--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread andrew cooke

just playing around - doesn't work with 3.0 due to lack of pattern binding
(which i think is coming back in 3.1?)

>>> from collections import defaultdict
>>> from itertools import count
>>> def mkdict(): return defaultdict(count().next)
...
>>> converters = defaultdict(mkdict)
>>> def to_float((i, s)):
...   try: return float(s)
...   except: return converters[i][s]
...
>>> def convert(line):
...   return map(to_float, zip(count(), line.split(',')))
...
>>> convert('1,2,red')
[1.0, 2.0, 0]
>>> convert('1,2,red')
[1.0, 2.0, 0]
>>> convert('1,2,blue')
[1.0, 2.0, 1]
>>> convert('1,2,blue,blue')
[1.0, 2.0, 1, 0]


andrew cooke wrote:
> Carl Banks wrote:
>> import collections
>> import itertools
>>
>> def createInitialCluster(fileName):
>> fixedPoints = []
>> # quantization is a dict that assigns sequentially-increasing
>> numbers
>> # to values when reading keys that don't yet exit
>> quantization = defaultdict.collections(itertools.count().next)
>> with open(fileName, 'r') as f:
>> for line in f:
>> dimensions = []
>> for s in line.rstrip('\n').split(","):
>> if isNumeric(s):
>> dimensions.append(float(s))
>> else:
>> dimensions.append(float(quantization[s]))
>> fixedPoints.append(Point(dimensions))
>> return Cluster(fixedPoints)
>
> nice reply (i didn't know defaultdict worked like that - very neat).
>
> two small things i noticed:
>
> 1 - do you need a separate quantization for each column?  the code above
> might give, for example, non-contiguous ranges of integers for a
> particular column if a string occurs ("by accident" perhaps) in more than
> one.
>
> 2 - don't bother with isNumeric.  just return the cast value or catch the
> exception:
>
>   [...]
>   try:
> dimensions.append(float(s))
>   except:
> dimensions.append(float(quantization[s]))
>
> (not sure float() is needed there either if you're using a recent version
> of python - only reason i can think of is to avoid integer division in
> older versions).
>
> andrew
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>


--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread R. David Murray
"andrew cooke"  wrote:
> Carl Banks wrote:
> > import collections
> > import itertools
> >
> > def createInitialCluster(fileName):
> > fixedPoints = []
> > # quantization is a dict that assigns sequentially-increasing
> > numbers
> > # to values when reading keys that don't yet exit
> > quantization = defaultdict.collections(itertools.count().next)
> > with open(fileName, 'r') as f:
> > for line in f:
> > dimensions = []
> > for s in line.rstrip('\n').split(","):
> > if isNumeric(s):
> > dimensions.append(float(s))
> > else:
> > dimensions.append(float(quantization[s]))
> > fixedPoints.append(Point(dimensions))
> > return Cluster(fixedPoints)
> 
> nice reply (i didn't know defaultdict worked like that - very neat).
> 
> two small things i noticed:
> 
> 1 - do you need a separate quantization for each column?  the code above
> might give, for example, non-contiguous ranges of integers for a
> particular column if a string occurs ("by accident" perhaps) in more than
> one.
> 
> 2 - don't bother with isNumeric.  just return the cast value or catch the
> exception:
> 
>   [...]
>   try:
> dimensions.append(float(s))
>   except:
> dimensions.append(float(quantization[s]))

No, no, no; never use a bare except! :)

Do it MRAB's way and catch ValueError.

--
R. David Murray http://www.bitdance.com

--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread MRAB

Carl Banks wrote:

MooMaster wrote:

So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:


[snip]

#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True



[snip]


import collections
import itertools

def createInitialCluster(fileName):
fixedPoints = []
# quantization is a dict that assigns sequentially-increasing
numbers
# to values when reading keys that don't yet exit
quantization = defaultdict.collections(itertools.count().next)
with open(fileName, 'r') as f:
for line in f:
dimensions = []
for s in line.rstrip('\n').split(","):
if isNumeric(s):
dimensions.append(float(s))
else:
dimensions.append(float(quantization[s]))
fixedPoints.append(Point(dimensions))
return Cluster(fixedPoints)


I'd also remove the isNumeric() function. You're just converting to a
float if you can successfully convert to a float!

try:
dimensions.append(float(s))
except ValueError:
dimensions.append(float(quantization[s]))
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread andrew cooke
Carl Banks wrote:
> import collections
> import itertools
>
> def createInitialCluster(fileName):
> fixedPoints = []
> # quantization is a dict that assigns sequentially-increasing
> numbers
> # to values when reading keys that don't yet exit
> quantization = defaultdict.collections(itertools.count().next)
> with open(fileName, 'r') as f:
> for line in f:
> dimensions = []
> for s in line.rstrip('\n').split(","):
> if isNumeric(s):
> dimensions.append(float(s))
> else:
> dimensions.append(float(quantization[s]))
> fixedPoints.append(Point(dimensions))
> return Cluster(fixedPoints)

nice reply (i didn't know defaultdict worked like that - very neat).

two small things i noticed:

1 - do you need a separate quantization for each column?  the code above
might give, for example, non-contiguous ranges of integers for a
particular column if a string occurs ("by accident" perhaps) in more than
one.

2 - don't bother with isNumeric.  just return the cast value or catch the
exception:

  [...]
  try:
dimensions.append(float(s))
  except:
dimensions.append(float(quantization[s]))

(not sure float() is needed there either if you're using a recent version
of python - only reason i can think of is to avoid integer division in
older versions).

andrew


--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread Peter Otten
Carl Banks wrote:

> On Apr 7, 12:38 am, Peter Otten <__pete...@web.de> wrote:
>> MooMaster wrote:
>> > Now we can't calculate a meaningful Euclidean distance for something
>> > like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
>> > distance or something overly complicated, so instead we'll use a
>> > simple quantization scheme of enumerating the set of values within the
>> > column domain and replacing the strings with numbers (i.e. Iris-setosa
>> > = 1, iris-versicolor=2).
>>
>> I'd calculate the distance as
>>
>> def string_dist(x, y, weight=1):
>> return weight * (x == y)
>>
>> You don't get a high resolution in that dimension, but you don't
>> introduce an element of randomness, either.
> 
> Does the algorithm require well-ordered data along the dimensions?
> Though I've never heard of it, the fact that it's called "bisecting
> Kmeans" suggests to me that it does, which means this wouldn't work.

I've read about K-Means in Segaran's "Collective Intelligence" which
describes it:

"K-Means clustering begins with k randomly placed centroids (points in space
that represent the center of the cluster), and assigns every item to the
nearest one. After the assignment, the centroids are moved to the average
location of all the nodes assigned to them, and the assignments are redone.
This process repeats until the assignments stop changing."
 
The book doesn't go into the theory, and "any distance would do" was my
assumption which may be wrong.

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread Carl Banks
On Apr 7, 12:38 am, Peter Otten <__pete...@web.de> wrote:
> MooMaster wrote:
> > Now we can't calculate a meaningful Euclidean distance for something
> > like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
> > distance or something overly complicated, so instead we'll use a
> > simple quantization scheme of enumerating the set of values within the
> > column domain and replacing the strings with numbers (i.e. Iris-setosa
> > = 1, iris-versicolor=2).
>
> I'd calculate the distance as
>
> def string_dist(x, y, weight=1):
>     return weight * (x == y)
>
> You don't get a high resolution in that dimension, but you don't introduce
> an element of randomness, either.

Does the algorithm require well-ordered data along the dimensions?
Though I've never heard of it, the fact that it's called "bisecting
Kmeans" suggests to me that it does, which means this wouldn't work.

However, the OP better be sure to set the scales for the quantized
dimensions high enough so that no clusters form containing points with
different discrete values.

That, in turn, suggests he might as well not even bother sending the
discrete values to the clustering algorithm, but instead to call it
for each unique set of discretes.  (However, I could imagine the
marginal cost of more dimensions is less than that of multiple runs;
I've been dealing with such a case at work.)

I'll leave it to the OP to decide.


Carl Banks
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread Peter Otten
Peter Otten wrote:

> MooMaster wrote:
> 
>> Now we can't calculate a meaningful Euclidean distance for something
>> like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
>> distance or something overly complicated, so instead we'll use a
>> simple quantization scheme of enumerating the set of values within the
>> column domain and replacing the strings with numbers (i.e. Iris-setosa
>> = 1, iris-versicolor=2).
> 
> I'd calculate the distance as
> 
> def string_dist(x, y, weight=1):
> return weight * (x == y)

oops, this must of course be (x != y).
 
> You don't get a high resolution in that dimension, but you don't introduce
> an element of randomness, either.
> 
> Peter
> 

--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread Peter Otten
MooMaster wrote:

> Now we can't calculate a meaningful Euclidean distance for something
> like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
> distance or something overly complicated, so instead we'll use a
> simple quantization scheme of enumerating the set of values within the
> column domain and replacing the strings with numbers (i.e. Iris-setosa
> = 1, iris-versicolor=2).

I'd calculate the distance as

def string_dist(x, y, weight=1):
return weight * (x == y)

You don't get a high resolution in that dimension, but you don't introduce
an element of randomness, either.

Peter

--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-06 Thread Carl Banks
MooMaster wrote:
> So I'm reading in values from a file, and for each column I need to
> dynamically discover the range of possible values it can take and
> quantize if necessary. This is the solution I've come up with:
>
> 
> def createInitialCluster(fileName):
> #get the data from the file
> points = []
> with open(fileName, 'r') as f:
> for line in f:
> points.append(line.rstrip('\n'))
> #clean up the data
> fixedPoints = []
> for point in points:
> dimensions = [quantize(i, points, point.split(",").index(i))
> for i in point.split(",")]
> print dimensions
> fixedPoints.append(Point(dimensions))
> #return an initial cluster of all the points
> return Cluster(fixedPoints)
>
> def quantize(stringToQuantize, pointList, columnIndex):
> #if it's numeric, no need to quantize
> if(isNumeric(stringToQuantize)):
> return float(stringToQuantize)
> #first we need to discover all the possible values of this column
> domain = []
> for point in pointList:
> domain.append(point.split(",")[columnIndex])
> #make it a set to remove duplicates
> domain = list(Set(domain))
> #use the index into the domain as the number representing this
> value
> return float(domain.index(stringToQuantize))
>
> #harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
> def isNumeric(string):
> try:
> i = float(string)
> except ValueError:
> return False
> return True

Big problem with this.  I'm guessing you never ran it on a really big
file yet.  Your quantize function does a lot of unnecessary work: it
rebuilds the list of indices for every single line, every single non-
numeric entry, in fact.  (Tech-speak: that is N^2 behavior in both the
number of lines and number of non-numeric columns.  Not good.)  This
will work ok for a small file, but will take forever on a large file
(perhaps literally).

So, forgetting about column indexing for a moment, we can improve this
vastly simply by generating the list of indices once.  Do that in a
separete function, have it return the list, and then pass that list to
the quantize function.  So replace the midportion of your
createInitialCluster function with something like this:


for i in xrange(len(points[0])): # number of columns
column_indices.append(quantize_column(points,i))
fixedPoints = []
for point in points:
dimensions = [quantize(s, column_indices[i], point.split
(",").index(i))
for (i,s) in enumerate(point.split(","))] # need index
as well as entry here
print dimensions
fixedPoints.append(Point(dimensions))


And the two functions would be something like this:

def quantize_columns(point_list,column_index):
# this assumes if the first column is numeric the whole column
would be
if(isNumeric(point_list[0][column_index])):
return None # don't quantize this column
#first we need to discover all the possible values of this column
domain = []
for point in point_list:
domain.append(point.split(",")[column_index])
#make it a set to remove duplicates
return list(set(domain))

def quantize(i,domain,s):
if domain is None:
return float(s)
return float(domain.index(s))

This (once debugged :) will run much, much faster on a large dataset.


Now back to your question.

> It works, but it feels a little ugly, and not exactly Pythonic. Using
> two lists I need the original point list to read in the data, then the
> dimensions one to hold the processed point, and a fixedPoint list to
> make objects out of the processed data. If my dataset is in the order
> of millions, this'll nuke the memory. I tried something like:
>
> for point in points:
>point = Point([quantize(i, points, point.split(",").index(i)) for i
> in point.split(",")])

> but when I print out the points afterward, it doesn't keep the
> changes.

It's because the order of items in a set is undefined.  The order of
the items in list(set(["a","b","c","d"])) might be very different from
the order in list(set(["a","b","c","d","e"])).  You were passing
quantize incomplete an incomplete list of points, so as the points
grew, the items in the set changed, and it messed up the order.  In
fact,  you should never rely on the order being the same, even if you
created the set with the very same arguments.

What you are trying to do should be done with dictionaries: create a
dict that maps a value to a number.

Now, based on your quantize function, it would seem that the number
associated with the value is arbitrary (doesn't matter what it is, as
long as it's distinct), so it isn't necessary to read the whole csv
file in before assigning numbers; just build the dict as you go.

I suggest collections.defaultdict for this task.  It's like a regular
dict, but it creates a new value any time you access a key that
doesn't exist, a perfect solution to your task. 

in place list modification necessary? What's a better idiom?

2009-04-06 Thread MooMaster
A similar discussion has already occurred, over 4 years ago:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/b806ada0732643d/5dff55826a199928?lnk=gst&q=list+in+place#5dff55826a199928

Nevertheless, I have a use-case where such a discussion comes up. For
my data mining class I'm writing an implementation of the bisecting
KMeans clustering algorithm (if you're not familiar with clustering
and are interested, this gives a decent example based overview:
http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt). Given a
CSV dataset of n records, we are to cluster them accordingly.

The dataset is generalizable enough to have any kind of data-type
(strings, floats, booleans, etc) for each of the record's columnar
values, for example here's a couple of  records from the famous iris
dataset:

5.1,3.5,1.4,0.2,Iris-setosa
6.4,3.2,4.5,1.5,Iris-versicolor

Now we can't calculate a meaningful Euclidean distance for something
like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
distance or something overly complicated, so instead we'll use a
simple quantization scheme of enumerating the set of values within the
column domain and replacing the strings with numbers (i.e. Iris-setosa
= 1, iris-versicolor=2).

So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:


def createInitialCluster(fileName):
#get the data from the file
points = []
with open(fileName, 'r') as f:
for line in f:
points.append(line.rstrip('\n'))
#clean up the data
fixedPoints = []
for point in points:
dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
print dimensions
fixedPoints.append(Point(dimensions))
#return an initial cluster of all the points
return Cluster(fixedPoints)

def quantize(stringToQuantize, pointList, columnIndex):
#if it's numeric, no need to quantize
if(isNumeric(stringToQuantize)):
return float(stringToQuantize)
#first we need to discover all the possible values of this column
domain = []
for point in pointList:
domain.append(point.split(",")[columnIndex])
#make it a set to remove duplicates
domain = list(Set(domain))
#use the index into the domain as the number representing this
value
return float(domain.index(stringToQuantize))

#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True



It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:

for point in points:
   point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])

but when I print out the points afterward, it doesn't keep the
changes.

What's a more efficient way of doing this?

--
http://mail.python.org/mailman/listinfo/python-list