[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134339#comment-14134339 ]
Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:24 PM: --------------------------------------------------------------- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P <: FP[T], C <: FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P <: FP[T], C <: FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} > K-Means clusterer should support Bregman distance functions > ----------------------------------------------------------- > > Key: SPARK-3219 > URL: https://issues.apache.org/jira/browse/SPARK-3219 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Derrick Burns > Assignee: Derrick Burns > > The K-Means clusterer supports the Euclidean distance metric. However, it is > rather straightforward to support Bregman > (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) > distance functions which would increase the utility of the clusterer > tremendously. > I have modified the clusterer to support pluggable distance functions. > However, I notice that there are hundreds of outstanding pull requests. If > someone is willing to work with me to sponsor the work through the process, I > will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org