[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120325#comment-14120325
 ] 

Derrick Burns commented on SPARK-3219:
--------------------------------------

Great!

You can find my work here:
https://github.com/derrickburns/generalized-kmeans-clustering.git.

I should warn you that I rewrote much of the original Spark clusterer
because the original is too tightly coupled to using the Euclidean norm and
does not
allow one to identify efficiently which points belong to which clusters.  I
have tested this version extensively.

You will notice a package call com.rincaro.clusterer.metrics.  Please take
a look at the two files EuOps.scala and FastEuclideansOps.scala.   They
both implement the Euclidean norm. However, one is much faster than the
other by using the same algebraic transformations that the Spark version
uses.  This demonstrates that
it is possible to be efficient while not being tightly coupled.   One could
easily re-implement FastEuclideanOps using Breeze or Blas without effecting
the core Kmeans implementation.

Not included in this project is another distance function that that I have
implemented: the Kullback-Leibler distance function, a.k.a. relative
entropy.  In my implementation, I also perform algebraic transformations to
expedite the computation, resulting in a distance computation that is even
faster than the fast euclidean norm.

Let me know if this is useful to you.





> K-Means clusterer should support Bregman distance functions
> -----------------------------------------------------------
>
>                 Key: SPARK-3219
>                 URL: https://issues.apache.org/jira/browse/SPARK-3219
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Derrick Burns
>            Assignee: Derrick Burns
>
> The K-Means clusterer supports the Euclidean distance metric.  However, it is 
> rather straightforward to support Bregman 
> (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
> distance functions which would increase the utility of the clusterer 
> tremendously.
> I have modified the clusterer to support pluggable distance functions.  
> However, I notice that there are hundreds of outstanding pull requests.  If 
> someone is willing to work with me to sponsor the work through the process, I 
> will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to