Re: in place list modification necessary? What's a better idiom?

2009-04-08 Thread MooMaster
On Apr 6, 10:43 pm, Carl Banks pavlovevide...@gmail.com wrote:
 MooMaster wrote:
  So I'm reading in values from a file, and for each column I need to
  dynamically discover the range of possible values it can take and
  quantize if necessary. This is the solution I've come up with:

  code
  def createInitialCluster(fileName):
      #get the data from the file
      points = []
      with open(fileName, 'r') as f:
          for line in f:
              points.append(line.rstrip('\n'))
      #clean up the data
      fixedPoints = []
      for point in points:
          dimensions = [quantize(i, points, point.split(,).index(i))
  for i in point.split(,)]
          print dimensions
          fixedPoints.append(Point(dimensions))
      #return an initial cluster of all the points
      return Cluster(fixedPoints)

  def quantize(stringToQuantize, pointList, columnIndex):
      #if it's numeric, no need to quantize
      if(isNumeric(stringToQuantize)):
          return float(stringToQuantize)
      #first we need to discover all the possible values of this column
      domain = []
      for point in pointList:
          domain.append(point.split(,)[columnIndex])
      #make it a set to remove duplicates
      domain = list(Set(domain))
      #use the index into the domain as the number representing this
  value
      return float(domain.index(stringToQuantize))

  #harvested fromhttp://www.rosettacode.org/wiki/IsNumeric#Python
  def isNumeric(string):
      try:
          i = float(string)
      except ValueError:
          return False
      return True

 Big problem with this.  I'm guessing you never ran it on a really big
 file yet.  Your quantize function does a lot of unnecessary work: it
 rebuilds the list of indices for every single line, every single non-
 numeric entry, in fact.  (Tech-speak: that is N^2 behavior in both the
 number of lines and number of non-numeric columns.  Not good.)  This
 will work ok for a small file, but will take forever on a large file
 (perhaps literally).

 So, forgetting about column indexing for a moment, we can improve this
 vastly simply by generating the list of indices once.  Do that in a
 separete function, have it return the list, and then pass that list to
 the quantize function.  So replace the midportion of your
 createInitialCluster function with something like this:

     
     for i in xrange(len(points[0])): # number of columns
         column_indices.append(quantize_column(points,i))
     fixedPoints = []
     for point in points:
         dimensions = [quantize(s, column_indices[i], point.split
 (,).index(i))
                 for (i,s) in enumerate(point.split(,))] # need index
 as well as entry here
         print dimensions
         fixedPoints.append(Point(dimensions))
     

 And the two functions would be something like this:

 def quantize_columns(point_list,column_index):
     # this assumes if the first column is numeric the whole column
 would be
     if(isNumeric(point_list[0][column_index])):
         return None # don't quantize this column
     #first we need to discover all the possible values of this column
     domain = []
     for point in point_list:
         domain.append(point.split(,)[column_index])
     #make it a set to remove duplicates
     return list(set(domain))

 def quantize(i,domain,s):
     if domain is None:
         return float(s)
     return float(domain.index(s))

 This (once debugged :) will run much, much faster on a large dataset.

 Now back to your question.

  It works, but it feels a little ugly, and not exactly Pythonic. Using
  two lists I need the original point list to read in the data, then the
  dimensions one to hold the processed point, and a fixedPoint list to
  make objects out of the processed data. If my dataset is in the order
  of millions, this'll nuke the memory. I tried something like:

  for point in points:
     point = Point([quantize(i, points, point.split(,).index(i)) for i
  in point.split(,)])
  but when I print out the points afterward, it doesn't keep the
  changes.

 It's because the order of items in a set is undefined.  The order of
 the items in list(set([a,b,c,d])) might be very different from
 the order in list(set([a,b,c,d,e])).  You were passing
 quantize incomplete an incomplete list of points, so as the points
 grew, the items in the set changed, and it messed up the order.  In
 fact,  you should never rely on the order being the same, even if you
 created the set with the very same arguments.

 What you are trying to do should be done with dictionaries: create a
 dict that maps a value to a number.

 Now, based on your quantize function, it would seem that the number
 associated with the value is arbitrary (doesn't matter what it is, as
 long as it's distinct), so it isn't necessary to read the whole csv
 file in before assigning numbers; just build the dict as you go.

 I suggest collections.defaultdict for this task.  It's like a regular
 dict, but it creates a new 

Re: in place list modification necessary? What's a better idiom?

2009-04-08 Thread MooMaster
On Apr 7, 12:40 am, Peter Otten __pete...@web.de wrote:
 Peter Otten wrote:
  MooMaster wrote:

  Now we can't calculate a meaningful Euclidean distance for something
  like Iris-setosa and Iris-versicolor unless we use string-edit
  distance or something overly complicated, so instead we'll use a
  simple quantization scheme of enumerating the set of values within the
  column domain and replacing the strings with numbers (i.e. Iris-setosa
  = 1, iris-versicolor=2).

  I'd calculate the distance as

  def string_dist(x, y, weight=1):
      return weight * (x == y)

 oops, this must of course be (x != y).

  You don't get a high resolution in that dimension, but you don't introduce
  an element of randomness, either.

  Peter



The randomness doesn't matter too much, all K-means cares about is a
distance between two points in a coordinate space and as long as that
space is invariant it doesn't matter too much (i.e. we don't want
(1,1) becoming (3,1) on the next iteration, or the value for a
quantized column changing). With that in mind, I was hoping to be lazy
and just go with an enumeration approach...

Nevertheless, it does introduce a subtle ordering for nominal data, as
if Iris-Setosa =1, Iris-Versicolor=2, and Iris-Virginica=3 then on
that scale Iris-Versicolor is intuitively closer to virginica than
setosa is, when in fact such distances don't mean anything on a
nominal scale. I hadn't thought about a function like that, but it
makes a lot of sense. Thanks!
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread Peter Otten
MooMaster wrote:

 Now we can't calculate a meaningful Euclidean distance for something
 like Iris-setosa and Iris-versicolor unless we use string-edit
 distance or something overly complicated, so instead we'll use a
 simple quantization scheme of enumerating the set of values within the
 column domain and replacing the strings with numbers (i.e. Iris-setosa
 = 1, iris-versicolor=2).

I'd calculate the distance as

def string_dist(x, y, weight=1):
return weight * (x == y)

You don't get a high resolution in that dimension, but you don't introduce
an element of randomness, either.

Peter

--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread Peter Otten
Peter Otten wrote:

 MooMaster wrote:
 
 Now we can't calculate a meaningful Euclidean distance for something
 like Iris-setosa and Iris-versicolor unless we use string-edit
 distance or something overly complicated, so instead we'll use a
 simple quantization scheme of enumerating the set of values within the
 column domain and replacing the strings with numbers (i.e. Iris-setosa
 = 1, iris-versicolor=2).
 
 I'd calculate the distance as
 
 def string_dist(x, y, weight=1):
 return weight * (x == y)

oops, this must of course be (x != y).
 
 You don't get a high resolution in that dimension, but you don't introduce
 an element of randomness, either.
 
 Peter
 

--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread Carl Banks
On Apr 7, 12:38 am, Peter Otten __pete...@web.de wrote:
 MooMaster wrote:
  Now we can't calculate a meaningful Euclidean distance for something
  like Iris-setosa and Iris-versicolor unless we use string-edit
  distance or something overly complicated, so instead we'll use a
  simple quantization scheme of enumerating the set of values within the
  column domain and replacing the strings with numbers (i.e. Iris-setosa
  = 1, iris-versicolor=2).

 I'd calculate the distance as

 def string_dist(x, y, weight=1):
     return weight * (x == y)

 You don't get a high resolution in that dimension, but you don't introduce
 an element of randomness, either.

Does the algorithm require well-ordered data along the dimensions?
Though I've never heard of it, the fact that it's called bisecting
Kmeans suggests to me that it does, which means this wouldn't work.

However, the OP better be sure to set the scales for the quantized
dimensions high enough so that no clusters form containing points with
different discrete values.

That, in turn, suggests he might as well not even bother sending the
discrete values to the clustering algorithm, but instead to call it
for each unique set of discretes.  (However, I could imagine the
marginal cost of more dimensions is less than that of multiple runs;
I've been dealing with such a case at work.)

I'll leave it to the OP to decide.


Carl Banks
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread Peter Otten
Carl Banks wrote:

 On Apr 7, 12:38 am, Peter Otten __pete...@web.de wrote:
 MooMaster wrote:
  Now we can't calculate a meaningful Euclidean distance for something
  like Iris-setosa and Iris-versicolor unless we use string-edit
  distance or something overly complicated, so instead we'll use a
  simple quantization scheme of enumerating the set of values within the
  column domain and replacing the strings with numbers (i.e. Iris-setosa
  = 1, iris-versicolor=2).

 I'd calculate the distance as

 def string_dist(x, y, weight=1):
 return weight * (x == y)

 You don't get a high resolution in that dimension, but you don't
 introduce an element of randomness, either.
 
 Does the algorithm require well-ordered data along the dimensions?
 Though I've never heard of it, the fact that it's called bisecting
 Kmeans suggests to me that it does, which means this wouldn't work.

I've read about K-Means in Segaran's Collective Intelligence which
describes it:

K-Means clustering begins with k randomly placed centroids (points in space
that represent the center of the cluster), and assigns every item to the
nearest one. After the assignment, the centroids are moved to the average
location of all the nodes assigned to them, and the assignments are redone.
This process repeats until the assignments stop changing.
 
The book doesn't go into the theory, and any distance would do was my
assumption which may be wrong.

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread andrew cooke
Carl Banks wrote:
 import collections
 import itertools

 def createInitialCluster(fileName):
 fixedPoints = []
 # quantization is a dict that assigns sequentially-increasing
 numbers
 # to values when reading keys that don't yet exit
 quantization = defaultdict.collections(itertools.count().next)
 with open(fileName, 'r') as f:
 for line in f:
 dimensions = []
 for s in line.rstrip('\n').split(,):
 if isNumeric(s):
 dimensions.append(float(s))
 else:
 dimensions.append(float(quantization[s]))
 fixedPoints.append(Point(dimensions))
 return Cluster(fixedPoints)

nice reply (i didn't know defaultdict worked like that - very neat).

two small things i noticed:

1 - do you need a separate quantization for each column?  the code above
might give, for example, non-contiguous ranges of integers for a
particular column if a string occurs (by accident perhaps) in more than
one.

2 - don't bother with isNumeric.  just return the cast value or catch the
exception:

  [...]
  try:
dimensions.append(float(s))
  except:
dimensions.append(float(quantization[s]))

(not sure float() is needed there either if you're using a recent version
of python - only reason i can think of is to avoid integer division in
older versions).

andrew


--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread R. David Murray
andrew cooke and...@acooke.org wrote:
 Carl Banks wrote:
  import collections
  import itertools
 
  def createInitialCluster(fileName):
  fixedPoints = []
  # quantization is a dict that assigns sequentially-increasing
  numbers
  # to values when reading keys that don't yet exit
  quantization = defaultdict.collections(itertools.count().next)
  with open(fileName, 'r') as f:
  for line in f:
  dimensions = []
  for s in line.rstrip('\n').split(,):
  if isNumeric(s):
  dimensions.append(float(s))
  else:
  dimensions.append(float(quantization[s]))
  fixedPoints.append(Point(dimensions))
  return Cluster(fixedPoints)
 
 nice reply (i didn't know defaultdict worked like that - very neat).
 
 two small things i noticed:
 
 1 - do you need a separate quantization for each column?  the code above
 might give, for example, non-contiguous ranges of integers for a
 particular column if a string occurs (by accident perhaps) in more than
 one.
 
 2 - don't bother with isNumeric.  just return the cast value or catch the
 exception:
 
   [...]
   try:
 dimensions.append(float(s))
   except:
 dimensions.append(float(quantization[s]))

No, no, no; never use a bare except! :)

Do it MRAB's way and catch ValueError.

--
R. David Murray http://www.bitdance.com

--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread MRAB

Carl Banks wrote:

MooMaster wrote:

So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:


[snip]

#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True



[snip]


import collections
import itertools

def createInitialCluster(fileName):
fixedPoints = []
# quantization is a dict that assigns sequentially-increasing
numbers
# to values when reading keys that don't yet exit
quantization = defaultdict.collections(itertools.count().next)
with open(fileName, 'r') as f:
for line in f:
dimensions = []
for s in line.rstrip('\n').split(,):
if isNumeric(s):
dimensions.append(float(s))
else:
dimensions.append(float(quantization[s]))
fixedPoints.append(Point(dimensions))
return Cluster(fixedPoints)


I'd also remove the isNumeric() function. You're just converting to a
float if you can successfully convert to a float!

try:
dimensions.append(float(s))
except ValueError:
dimensions.append(float(quantization[s]))
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread andrew cooke

just playing around - doesn't work with 3.0 due to lack of pattern binding
(which i think is coming back in 3.1?)

 from collections import defaultdict
 from itertools import count
 def mkdict(): return defaultdict(count().next)
...
 converters = defaultdict(mkdict)
 def to_float((i, s)):
...   try: return float(s)
...   except: return converters[i][s]
...
 def convert(line):
...   return map(to_float, zip(count(), line.split(',')))
...
 convert('1,2,red')
[1.0, 2.0, 0]
 convert('1,2,red')
[1.0, 2.0, 0]
 convert('1,2,blue')
[1.0, 2.0, 1]
 convert('1,2,blue,blue')
[1.0, 2.0, 1, 0]


andrew cooke wrote:
 Carl Banks wrote:
 import collections
 import itertools

 def createInitialCluster(fileName):
 fixedPoints = []
 # quantization is a dict that assigns sequentially-increasing
 numbers
 # to values when reading keys that don't yet exit
 quantization = defaultdict.collections(itertools.count().next)
 with open(fileName, 'r') as f:
 for line in f:
 dimensions = []
 for s in line.rstrip('\n').split(,):
 if isNumeric(s):
 dimensions.append(float(s))
 else:
 dimensions.append(float(quantization[s]))
 fixedPoints.append(Point(dimensions))
 return Cluster(fixedPoints)

 nice reply (i didn't know defaultdict worked like that - very neat).

 two small things i noticed:

 1 - do you need a separate quantization for each column?  the code above
 might give, for example, non-contiguous ranges of integers for a
 particular column if a string occurs (by accident perhaps) in more than
 one.

 2 - don't bother with isNumeric.  just return the cast value or catch the
 exception:

   [...]
   try:
 dimensions.append(float(s))
   except:
 dimensions.append(float(quantization[s]))

 (not sure float() is needed there either if you're using a recent version
 of python - only reason i can think of is to avoid integer division in
 older versions).

 andrew


 --
 http://mail.python.org/mailman/listinfo/python-list




--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread andrew cooke
R. David Murray wrote:
   [...]
   try:
 dimensions.append(float(s))
   except:
 dimensions.append(float(quantization[s]))

 No, no, no; never use a bare except! :)

can you explain why?  i can't think of any reason why the code would be
better catching a specific exception.

as a general rule, maybe, but in this particular case i can't see a reason
- so i'm not sure if you're just pedantically following rules or if i've
missed something i should know.

andrew

--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread MRAB

andrew cooke wrote:

R. David Murray wrote:

  [...]
  try:
dimensions.append(float(s))
  except:
dimensions.append(float(quantization[s]))

No, no, no; never use a bare except! :)


can you explain why?  i can't think of any reason why the code would be
better catching a specific exception.

as a general rule, maybe, but in this particular case i can't see a reason
- so i'm not sure if you're just pedantically following rules or if i've
missed something i should know.


A bare exception will catch _any_ exception, even those you didn't
expect (technical term: bug :-)); for example, if you had written:

try:
dimension.append(float(s))
except:
dimensions.append(float(quantization[s]))

it would silently catch NameError: name 'dimension' is not defined and
perform the fallback action.
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread R. David Murray

On Tue, 7 Apr 2009 at 09:01, andrew cooke wrote:

R. David Murray wrote:

  [...]
  try:
dimensions.append(float(s))
  except:
dimensions.append(float(quantization[s]))


No, no, no; never use a bare except! :)


can you explain why?  i can't think of any reason why the code would be
better catching a specific exception.

as a general rule, maybe, but in this particular case i can't see a reason
- so i'm not sure if you're just pedantically following rules or if i've
missed something i should know.


What if the user pressed ctl-c right when the float was being converted
and appended?

Never use a bare except unless you have a specific reason to do so,
and there are very few of those.  (Yes, I should have said it that way
to start with, my apologies for going hyperbolic.)  Using because it
doesn't _look_ like it will cause issues is just asking for hard to
track down bugs :).

--
R. David Murray http://www.bitdance.com
--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-07 Thread andrew cooke
MRAB wrote:
 andrew cooke wrote:
 R. David Murray wrote:
   [...]
   try:
 dimensions.append(float(s))
   except:
 dimensions.append(float(quantization[s]))
 No, no, no; never use a bare except! :)

 can you explain why?  i can't think of any reason why the code would be
 better catching a specific exception.

 as a general rule, maybe, but in this particular case i can't see a
 reason
 - so i'm not sure if you're just pedantically following rules or if i've
 missed something i should know.

 A bare exception will catch _any_ exception, even those you didn't
 expect (technical term: bug :-)); for example, if you had written:

  try:
  dimension.append(float(s))
  except:
  dimensions.append(float(quantization[s]))

 it would silently catch NameError: name 'dimension' is not defined and
 perform the fallback action.

true, i hadn't considered coding errors (i was thinking that something
global, like out of memory, would fail again anyway).  andrew

--
http://mail.python.org/mailman/listinfo/python-list


Re: in place list modification necessary? What's a better idiom?

2009-04-06 Thread Carl Banks
MooMaster wrote:
 So I'm reading in values from a file, and for each column I need to
 dynamically discover the range of possible values it can take and
 quantize if necessary. This is the solution I've come up with:

 code
 def createInitialCluster(fileName):
 #get the data from the file
 points = []
 with open(fileName, 'r') as f:
 for line in f:
 points.append(line.rstrip('\n'))
 #clean up the data
 fixedPoints = []
 for point in points:
 dimensions = [quantize(i, points, point.split(,).index(i))
 for i in point.split(,)]
 print dimensions
 fixedPoints.append(Point(dimensions))
 #return an initial cluster of all the points
 return Cluster(fixedPoints)

 def quantize(stringToQuantize, pointList, columnIndex):
 #if it's numeric, no need to quantize
 if(isNumeric(stringToQuantize)):
 return float(stringToQuantize)
 #first we need to discover all the possible values of this column
 domain = []
 for point in pointList:
 domain.append(point.split(,)[columnIndex])
 #make it a set to remove duplicates
 domain = list(Set(domain))
 #use the index into the domain as the number representing this
 value
 return float(domain.index(stringToQuantize))

 #harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
 def isNumeric(string):
 try:
 i = float(string)
 except ValueError:
 return False
 return True

Big problem with this.  I'm guessing you never ran it on a really big
file yet.  Your quantize function does a lot of unnecessary work: it
rebuilds the list of indices for every single line, every single non-
numeric entry, in fact.  (Tech-speak: that is N^2 behavior in both the
number of lines and number of non-numeric columns.  Not good.)  This
will work ok for a small file, but will take forever on a large file
(perhaps literally).

So, forgetting about column indexing for a moment, we can improve this
vastly simply by generating the list of indices once.  Do that in a
separete function, have it return the list, and then pass that list to
the quantize function.  So replace the midportion of your
createInitialCluster function with something like this:


for i in xrange(len(points[0])): # number of columns
column_indices.append(quantize_column(points,i))
fixedPoints = []
for point in points:
dimensions = [quantize(s, column_indices[i], point.split
(,).index(i))
for (i,s) in enumerate(point.split(,))] # need index
as well as entry here
print dimensions
fixedPoints.append(Point(dimensions))


And the two functions would be something like this:

def quantize_columns(point_list,column_index):
# this assumes if the first column is numeric the whole column
would be
if(isNumeric(point_list[0][column_index])):
return None # don't quantize this column
#first we need to discover all the possible values of this column
domain = []
for point in point_list:
domain.append(point.split(,)[column_index])
#make it a set to remove duplicates
return list(set(domain))

def quantize(i,domain,s):
if domain is None:
return float(s)
return float(domain.index(s))

This (once debugged :) will run much, much faster on a large dataset.


Now back to your question.

 It works, but it feels a little ugly, and not exactly Pythonic. Using
 two lists I need the original point list to read in the data, then the
 dimensions one to hold the processed point, and a fixedPoint list to
 make objects out of the processed data. If my dataset is in the order
 of millions, this'll nuke the memory. I tried something like:

 for point in points:
point = Point([quantize(i, points, point.split(,).index(i)) for i
 in point.split(,)])

 but when I print out the points afterward, it doesn't keep the
 changes.

It's because the order of items in a set is undefined.  The order of
the items in list(set([a,b,c,d])) might be very different from
the order in list(set([a,b,c,d,e])).  You were passing
quantize incomplete an incomplete list of points, so as the points
grew, the items in the set changed, and it messed up the order.  In
fact,  you should never rely on the order being the same, even if you
created the set with the very same arguments.

What you are trying to do should be done with dictionaries: create a
dict that maps a value to a number.

Now, based on your quantize function, it would seem that the number
associated with the value is arbitrary (doesn't matter what it is, as
long as it's distinct), so it isn't necessary to read the whole csv
file in before assigning numbers; just build the dict as you go.

I suggest collections.defaultdict for this task.  It's like a regular
dict, but it creates a new value any time you access a key that
doesn't exist, a perfect solution to your task.  We'll pass it a
function that generates a different index each time it's called.