[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078078#comment-14078078 ]
RJ Nowling commented on SPARK-2308: ----------------------------------- I did all of my tests with scikit-learn given your suggestion. Scikit-learn uses k-means++, not k-means||. I should have made that clear. I'm not clear on what you're looking for. I have a few observations at this point: 1. KMeans seems to be very sensitive to initialization -- cluster positions doesn't seem to change significantly after initialization 2. Initialization seems to be more important than whether you use KMeans or KMeans MiniBatch -- given the same initialization, they tend to do equally well 3. Random and kmeans++ / kmeans|| initialization methods seem sensitive to variations in cluster sizes. However, I'm happy to run more tests if you think they will be useful, but at this point, I feel the behavior we're seeing is expected. Hierarchical KMeans or methods such as KCenters, which guarantee that the space is partitioned equally (regardless of cluster density), may be useful for cases where KMeans doesn't perform as desired. > Add KMeans MiniBatch clustering algorithm to MLlib > -------------------------------------------------- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: RJ Nowling > Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)