[jira] [Commented] (SPARK-8402) Add DP means clustering to MLlib

Nick Pentreath (JIRA) Thu, 11 May 2017 01:10:32 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006060#comment-16006060
 ]


Nick Pentreath commented on SPARK-8402:
---------------------------------------

I'm afraid I would say there is not sufficient demand or committer bandwidth 
for this feature at present.

> Add DP means clustering to MLlib
> --------------------------------
>
>                 Key: SPARK-8402
>                 URL: https://issues.apache.org/jira/browse/SPARK-8402
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Meethu Mathew
>            Assignee: Meethu Mathew
>              Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of 
> clusters to be specified in advance. 
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
> that allows for flexible clustering of data without having to specify apriori 
> the number of clusters. 
> DP means is a non-parametric clustering algorithm that uses a scale parameter 
> 'lambda' to control the creation of new clusters ["Revisiting k-means: New 
> Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been 
> proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
> by Xinghao Pan, Evan R. Sparks, Andre Wibisono.
> A benchmark comparison between k-means and dp-means based on Normalized 
> Mutual Information between ground truth clusters and algorithm outputs, have 
> been provided in the following table. It can be seen from the table that 
> DP-means reported a higher NMI on 5 of 8 data sets in comparison to 
> k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms 
> via Bayesian nonparametrics (2011) Arxiv:1111.0352. (Table 1)]
> | Dataset       | DP-means | k-means |
> | Wine          | .41      | .43     |
> | Iris          | .75      | .76     |
> | Pima          | .02      | .03     |
> | Soybean       | .72      | .66     |
> | Car           | .07      | .05     |
> | Balance Scale | .17      | .11     |
> | Breast Cancer | .04      | .03     |
> | Vehicle       | .18      | .18     |
> Experiment on our spark cluster setup:
> An initial benchmark study was performed on a 3 node Spark cluster setup on 
> mesos where each node config was 8 Cores, 64 GB RAM and the spark version 
> used was 1.5(git branch).
> Tests were done using a mixture of 10 Gaussians with varying number of 
> features and instances. The results from the benchmark study are provided 
> below. The reported stats are average over 5 runs. 
> | DATASET     |            |         DPMEANS         |       |                
>          | KMEANS (k =10) |                         |
> | Instances   | Dimensions | No of clusters obtained | Time  | Converged in 
> iterations |      Time      | Converged in iterations |
> |  10 million |     10     |            10           | 43.6s |            2   
>          |      52.2s     |            2            |
> |  1 million  |     100    |            10           | 39.8s |            2   
>          |     43.39s     |            2            |
> | 0.1 million |    1000    |            10           | 37.3s |            2   
>          |     41.64s     |            2            |



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8402) Add DP means clustering to MLlib

Reply via email to