[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063601#comment-14063601
 ] 

RJ Nowling commented on SPARK-2308:
-----------------------------------

I tested kmeans vs minibatch kmeans under 2 scenarios:

* 4 centers of 1000, 100, 10, and 1 data points.
* 100 centers with 10 points each

The proposed centers were generated along a grid.  The data points were 
generated by adding samples from N(0, 1.0) in each dimension to the centers. I 
found the expected centers by averaging the points generated from each proposed 
center.

I ran KMeans and MiniBatch KMeans for each set of data points with 30 
iterations and k-means++ initialization.

I plotted the expected centers (blue), KMeans centers (red), and MiniBatch 
centers (green).  The two method showed similar results.  They both struggled 
with the small clusters and ended up finding two centers for the large cluster, 
ignoring the single data point.  For the 100 even clusters, both methods got 
most of the centers reasonably correct and in a few cases, had 2 centers where 
there should be 1.

I've attached the plots (many_small_centers,pdf, uneven_centers.pdf).

In reviewing the scikit-learn implementation, I saw that they handled small 
clusters as special cases.  In the case of small clusters, one of the points in 
the cluster is randomly chosen as the center instead of finding the center as a 
running average of the points sampled.


> Add KMeans MiniBatch clustering algorithm to MLlib
> --------------------------------------------------
>
>                 Key: SPARK-2308
>                 URL: https://issues.apache.org/jira/browse/SPARK-2308
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Priority: Minor
>         Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to