[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055770#comment-14055770 ]
Doris Xin commented on SPARK-2308: ---------------------------------- Hey guys, Sorry to crash the party. I don't think small clusters are actually a problem since you're using a fixed sample size instead of a sampling rate. So for small clusters whose sizes are comparable to the batchSize, you'd have a sampling rate ~1.0, which means the entire cluster is picked up in the sample. Alternatively, you can look into congressional sampling: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.1057&rep=rep1&type=pdf, where there's both a fixed size portion and a portion that's proportional to the cluster size in each sample. > Add KMeans MiniBatch clustering algorithm to MLlib > -------------------------------------------------- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: RJ Nowling > Priority: Minor > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)