[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063601#comment-14063601 ]
RJ Nowling commented on SPARK-2308: ----------------------------------- I tested kmeans vs minibatch kmeans under 2 scenarios: * 4 centers of 1000, 100, 10, and 1 data points. * 100 centers with 10 points each The proposed centers were generated along a grid. The data points were generated by adding samples from N(0, 1.0) in each dimension to the centers. I found the expected centers by averaging the points generated from each proposed center. I ran KMeans and MiniBatch KMeans for each set of data points with 30 iterations and k-means++ initialization. I plotted the expected centers (blue), KMeans centers (red), and MiniBatch centers (green). The two method showed similar results. They both struggled with the small clusters and ended up finding two centers for the large cluster, ignoring the single data point. For the 100 even clusters, both methods got most of the centers reasonably correct and in a few cases, had 2 centers where there should be 1. I've attached the plots (many_small_centers,pdf, uneven_centers.pdf). In reviewing the scikit-learn implementation, I saw that they handled small clusters as special cases. In the case of small clusters, one of the points in the cluster is randomly chosen as the center instead of finding the center as a running average of the points sampled. > Add KMeans MiniBatch clustering algorithm to MLlib > -------------------------------------------------- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: RJ Nowling > Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)