[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052558#comment-14052558
 ] 

RJ Nowling commented on SPARK-2308:
-----------------------------------

Hi Xiangrui,

Here's the paper:
http://www.ra.ethz.ch/CDstore/www2010/www/p1177.pdf

This discussion in the scikit-learn documentation could also be useful:
http://scikit-learn.org/stable/modules/clustering.html

I agree that smaller clusters will be at a disadvantage with uniform sampling.  
I imagine one could weight the points inversely by cluster size or the like.  
However, the challenge would be to do it in a way that doesn't require touching 
all of the data points.  The MiniBatch approach only samples batchSize number 
of data points in each iteration.  Those data points are used to update their 
respective centers.  You would have to reassign all the data points to the 
updated cluster centers in each iteration to prevent the weights from quickly 
becoming inaccurate.  This would defeat one of the main optimizations of the 
method.

Do you have any suggestions on how to achieve the weighting in a way that would 
maintain the properties necessary for convergence and keep the efficiency 
advantages?

Thanks!

> Add KMeans MiniBatch clustering algorithm to MLlib
> --------------------------------------------------
>
>                 Key: SPARK-2308
>                 URL: https://issues.apache.org/jira/browse/SPARK-2308
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Priority: Minor
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to