[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

RJ Nowling (JIRA) Tue, 29 Jul 2014 11:01:41 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078078#comment-14078078
 ]


RJ Nowling commented on SPARK-2308:
-----------------------------------

I did all of my tests with scikit-learn given your suggestion.  Scikit-learn 
uses k-means++, not k-means||.    I should have made that clear.

I'm not clear on what you're looking for.

I have a few observations at this point:
1. KMeans seems to be very sensitive to initialization -- cluster positions 
doesn't seem to change significantly after initialization
2. Initialization seems to be more important than whether you use KMeans or 
KMeans MiniBatch -- given the same initialization, they tend to do equally well 
3. Random and kmeans++ / kmeans|| initialization methods seem sensitive to 
variations in cluster sizes.

However, I'm happy to run more tests if you think they will be useful, but at 
this point, I feel the behavior we're seeing is expected.  Hierarchical KMeans 
or methods such as KCenters, which guarantee that the space is partitioned 
equally (regardless of cluster density), may be useful for cases where KMeans 
doesn't perform as desired.


> Add KMeans MiniBatch clustering algorithm to MLlib
> --------------------------------------------------
>
>                 Key: SPARK-2308
>                 URL: https://issues.apache.org/jira/browse/SPARK-2308
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Priority: Minor
>         Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

Reply via email to