[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134291#comment-14134291 ]
Derrick Burns commented on SPARK-2308: -------------------------------------- I have submitted several issues regarding the Spark K-Means clustering implementation. I would like to contribute my new K-Means implementation to Spark. However, my implementation is a re-write, and not simply a trivial pull-request. Why did I re-write it? The initial reason for rewriting the code is that the 1.0.2 (and 1.1.0) implementation does not support the distance function that I wanted. I use the KL-Divergence measure, whereas yours uses the squared Euclidean distance. Unfortunately, the Euclidean metric is to deeply ingrained in the 1.0.2 implementation, that a major re-write was required. My implementation has a pluggable distance function. I wrote an implementation of the squared Euclidean distance function (actually I wrote two versions), and an implementation of KL-divergence. With this abstraction, the core K-Means algorithm knows nothing about the distance function. The second reason is that I noticed that the 1.0.2. implementation is rather slow because it recomputes all distances to all points on all rounds. Further, on my experiments on millions of points and thousands of cluster with 100s of dimensions, the K-Means algorithm is CPU bound with very very little memory being used. I observed that we can eliminate the vast majority of distance calculations by maintaining for each point its closest cluster, the distance to that cluster, and the round that either of those values was changed and for each cluster which round that cluster was last moved. This is a modest amount of addition space, but results in a dramatic improvement in running time. I have tested my new implementation extensively on EC2 using up to 16 c3.8xlarge worker machines. The application is in financial services, so I need the results to be right. :) I call my implementation "Tracking K-Means". If you would entertain including my implementation in a future Spark release, please let me know. > Add KMeans MiniBatch clustering algorithm to MLlib > -------------------------------------------------- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: RJ Nowling > Assignee: RJ Nowling > Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org