[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052558#comment-14052558 ]
RJ Nowling commented on SPARK-2308: ----------------------------------- Hi Xiangrui, Here's the paper: http://www.ra.ethz.ch/CDstore/www2010/www/p1177.pdf This discussion in the scikit-learn documentation could also be useful: http://scikit-learn.org/stable/modules/clustering.html I agree that smaller clusters will be at a disadvantage with uniform sampling. I imagine one could weight the points inversely by cluster size or the like. However, the challenge would be to do it in a way that doesn't require touching all of the data points. The MiniBatch approach only samples batchSize number of data points in each iteration. Those data points are used to update their respective centers. You would have to reassign all the data points to the updated cluster centers in each iteration to prevent the weights from quickly becoming inaccurate. This would defeat one of the main optimizations of the method. Do you have any suggestions on how to achieve the weighting in a way that would maintain the properties necessary for convergence and keep the efficiency advantages? Thanks! > Add KMeans MiniBatch clustering algorithm to MLlib > -------------------------------------------------- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: RJ Nowling > Priority: Minor > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)