[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120325#comment-14120325 ]
Derrick Burns commented on SPARK-3219: -------------------------------------- Great! You can find my work here: https://github.com/derrickburns/generalized-kmeans-clustering.git. I should warn you that I rewrote much of the original Spark clusterer because the original is too tightly coupled to using the Euclidean norm and does not allow one to identify efficiently which points belong to which clusters. I have tested this version extensively. You will notice a package call com.rincaro.clusterer.metrics. Please take a look at the two files EuOps.scala and FastEuclideansOps.scala. They both implement the Euclidean norm. However, one is much faster than the other by using the same algebraic transformations that the Spark version uses. This demonstrates that it is possible to be efficient while not being tightly coupled. One could easily re-implement FastEuclideanOps using Breeze or Blas without effecting the core Kmeans implementation. Not included in this project is another distance function that that I have implemented: the Kullback-Leibler distance function, a.k.a. relative entropy. In my implementation, I also perform algebraic transformations to expedite the computation, resulting in a distance computation that is even faster than the fast euclidean norm. Let me know if this is useful to you. > K-Means clusterer should support Bregman distance functions > ----------------------------------------------------------- > > Key: SPARK-3219 > URL: https://issues.apache.org/jira/browse/SPARK-3219 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Derrick Burns > Assignee: Derrick Burns > > The K-Means clusterer supports the Euclidean distance metric. However, it is > rather straightforward to support Bregman > (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) > distance functions which would increase the utility of the clusterer > tremendously. > I have modified the clusterer to support pluggable distance functions. > However, I notice that there are hundreds of outstanding pull requests. If > someone is willing to work with me to sponsor the work through the process, I > will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org