[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM:
---------------------------------------------------------------

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (the type used to 
represent distance, such as Float or Double), T (the data type used for a point 
by the K-Means client), P (the data type used for a point by the distance 
function), C (the data type used for a cluster center by the distance 
function), and Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
    val weight: Distance
    val index: Option[T]
    val raw : Array[Distance]
  }

  trait PointOps[P <: FP[T], C <: FP[T], T] {
    def distance(p: P, c: C, upperBound: Distance): Distance

    def userToPoint(v: Array[Double], index: Option[T]): P

    def centerToPoint(v: C): P

    def pointToCenter(v: P): C

    def centroidToCenter(v: Centroid): C

    def centroidToPoint(v: Centroid): P

    def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (the type used to 
represent distance, such as Float or Double), T (the data type used for a point 
by the K-Means client), P (the data type used for a point by the distance 
function), C (the data type used for a cluster center by the distance 
function), and Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
    val weight: Distance
    val index: Option[T]
    val raw : Array[Distance]
  }
  trait PointOps[P <: FP[T], C <: FP[T], T] {
    def distance(p: P, c: C, upperBound: Distance): Distance

    def userToPoint(v: Array[Double], index: Option[T]): P

    def centerToPoint(v: C): P

    def pointToCenter(v: P): C

    def centroidToCenter(v: Centroid): C

    def centroidToPoint(v: Centroid): P

    def centerMoved(v: P, w: C): Boolean

  }
{code}

> K-Means clusterer should support Bregman distance functions
> -----------------------------------------------------------
>
>                 Key: SPARK-3219
>                 URL: https://issues.apache.org/jira/browse/SPARK-3219
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Derrick Burns
>            Assignee: Derrick Burns
>
> The K-Means clusterer supports the Euclidean distance metric.  However, it is 
> rather straightforward to support Bregman 
> (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
> distance functions which would increase the utility of the clusterer 
> tremendously.
> I have modified the clusterer to support pluggable distance functions.  
> However, I notice that there are hundreds of outstanding pull requests.  If 
> someone is willing to work with me to sponsor the work through the process, I 
> will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to