[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134339#comment-14134339 ]
Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:14 PM: --------------------------------------------------------------- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} trait PointOps[P <: FP[T], C <: FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. trait PointOps[P <: FP[T], C <: FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } > K-Means clusterer should support Bregman distance functions > ----------------------------------------------------------- > > Key: SPARK-3219 > URL: https://issues.apache.org/jira/browse/SPARK-3219 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Derrick Burns > Assignee: Derrick Burns > > The K-Means clusterer supports the Euclidean distance metric. However, it is > rather straightforward to support Bregman > (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) > distance functions which would increase the utility of the clusterer > tremendously. > I have modified the clusterer to support pluggable distance functions. > However, I notice that there are hundreds of outstanding pull requests. If > someone is willing to work with me to sponsor the work through the process, I > will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org