[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2015-04-07 Thread Sai Nishanth Parepally (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482297#comment-14482297
 ] 

Sai Nishanth Parepally edited comment on SPARK-3219 at 4/7/15 6:10 PM:
---

[~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering 
going to be merged into mllib as I would like to use jaccard distance as a 
distance metric for kmeans clustering? and I would like to know if I should add 
this distance metric to derrickburns's repository or just make the current 
mllib's implementation of kmeans accept a method which computes the distance 
between any two points?


was (Author: nishanthps):
[~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering 
going to be merged into mllib as I would like to use jaccard distance as a 
distance metric for kmeans clustering?

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns
  Labels: clustering

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:14 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation.

{code}
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation.

  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:16 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}

type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:16 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}

type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:15 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation.

{code}
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values as is done with the Fast Euclidean distance 
which pre-computes magnitudes.  (With more complex distance functions such as 
the Kullback-Leibler function, one can pre-compute the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values as is done with the Fast Euclidean distance 
which pre-computes magnitudes.  (With more complex distance functions such as 
the Kullback-Leibler function, one can pre-compute the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values as is done with the Fast Euclidean distance 
which pre-computes magnitudes.  (With more complex distance functions such as 
the Kullback-Leibler function, one can pre-compute the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values as is done with the Fast Euclidean distance 
which pre-computes magnitudes.  (With more complex distance functions such as 
the Kullback-Leibler function, one can pre-compute the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:24 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (the type used to 
represent distance, such as Float or Double), T (the data type used for a point 
by the K-Means client), P (the data type used for a point by the distance 
function), C (the data type used for a cluster center by the distance 
function), and Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (the type used to 
represent distance, such as Float or Double), T (the data type used for a point 
by the K-Means client), P (the data type used for a point by the distance 
function), C (the data type used for a cluster center by the distance 
function), and Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }

  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (the type used to 
represent distance, such as Float or Double), T (the data type used for a point 
by the K-Means client), P (the data type used for a point by the distance 
function), C (the data type used for a cluster center by the distance 
function), and Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120325#comment-14120325
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM:
---

Great!

You can find my work here:
https://github.com/derrickburns/generalized-kmeans-clustering.git.

I should warn you that I rewrote much of the original Spark clusterer
because the original is too tightly coupled to using the Euclidean norm and
does not
allow one to identify efficiently which points belong to which clusters.  I
have tested this version extensively.

You will notice a package call com.rincaro.clusterer.metrics.  Please take
a look at the two files EuOps.scala and FastEuclideansOps.scala.   They
both implement the Euclidean norm. However, one is much faster than the
other by using the same algebraic transformations that the Spark version
uses.  This demonstrates that
it is possible to be efficient while not being tightly coupled.   One could
easily re-implement FastEuclideanOps using Breeze or Blas without affecting
the core Kmeans implementation.

Not included in this project is another distance function that that I have
implemented: the Kullback-Leibler distance function, a.k.a. relative
entropy.  In my implementation, I also perform algebraic transformations to
expedite the computation, resulting in a distance computation that is even
faster than the fast euclidean norm.

Let me know if this is useful to you.






was (Author: derrickburns):
Great!

You can find my work here:
https://github.com/derrickburns/generalized-kmeans-clustering.git.

I should warn you that I rewrote much of the original Spark clusterer
because the original is too tightly coupled to using the Euclidean norm and
does not
allow one to identify efficiently which points belong to which clusters.  I
have tested this version extensively.

You will notice a package call com.rincaro.clusterer.metrics.  Please take
a look at the two files EuOps.scala and FastEuclideansOps.scala.   They
both implement the Euclidean norm. However, one is much faster than the
other by using the same algebraic transformations that the Spark version
uses.  This demonstrates that
it is possible to be efficient while not being tightly coupled.   One could
easily re-implement FastEuclideanOps using Breeze or Blas without effecting
the core Kmeans implementation.

Not included in this project is another distance function that that I have
implemented: the Kullback-Leibler distance function, a.k.a. relative
entropy.  In my implementation, I also perform algebraic transformations to
expedite the computation, resulting in a distance computation that is even
faster than the fast euclidean norm.

Let me know if this is useful to you.





 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org