[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134291#comment-14134291
 ] 

Derrick Burns commented on SPARK-2308:
--------------------------------------

I have submitted several issues regarding the Spark K-Means clustering
implementation.  I would like to contribute my new K-Means implementation
to Spark. However, my implementation is a re-write, and not simply a
trivial pull-request.

Why did I re-write it?

The initial reason for rewriting the code is that the 1.0.2 (and 1.1.0)
implementation does not support the distance function that I wanted.  I use
the KL-Divergence measure, whereas yours uses the squared Euclidean
distance.  Unfortunately, the Euclidean metric is to deeply ingrained in
the 1.0.2 implementation, that a major re-write was required.

My implementation has a pluggable distance function.  I wrote an
implementation of the squared Euclidean distance function (actually I wrote
two versions), and an implementation of KL-divergence. With this
abstraction, the core K-Means algorithm knows nothing about the distance
function.

The second reason is that I noticed that the 1.0.2. implementation is
rather slow because it recomputes all distances to all points on all
rounds.  Further, on my experiments on millions of points and thousands of
cluster with 100s of dimensions, the K-Means algorithm is CPU bound with
very very little memory being used.

I observed that we can eliminate the vast majority of distance calculations
by maintaining for each point its closest cluster, the distance to that
cluster, and the round that either of those values was changed and for each
cluster which round that cluster was last moved. This is a modest amount of
addition space, but results in a dramatic improvement in running time.

I have tested my new implementation extensively on EC2 using  up to 16
c3.8xlarge worker machines.  The application is in financial services, so I
need the results to be right. :)

I call my implementation "Tracking K-Means".  If you would entertain
including my implementation in a future Spark release, please let me know.




> Add KMeans MiniBatch clustering algorithm to MLlib
> --------------------------------------------------
>
>                 Key: SPARK-2308
>                 URL: https://issues.apache.org/jira/browse/SPARK-2308
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
>            Priority: Minor
>         Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to