[ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meethu Mathew updated SPARK-8402:
---------------------------------
    Description: 
At present, all the clustering algorithms in MLlib require the number of 
clusters to be specified in advance. 
The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
that allows for flexible clustering of data without having to specify apriori 
the number of clusters. 
DP means is a non-parametric clustering algorithm that uses a scale parameter 
'lambda' to control the creation of new clusters ["Revisiting k-means: New 
Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].

We have followed the distributed implementation of DP means which has been 
proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
by Xinghao Pan, Evan R. Sparks, Andre Wibisono.

A benchmark comparison between k-means and dp-means based on Normalized Mutual 
Information between ground truth clusters and algorithm outputs, have been 
provided in the following table. It can be seen from the table that DP-means 
reported a higher NMI on 5 of 8 data sets in comparison to k-means[Source: 
Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian 
nonparametrics (2011) Arxiv:1111.0352. (Table 1)]

| Dataset       | DP-means | k-means |
| Wine          | .41      | .43     |
| Iris          | .75      | .76     |
| Pima          | .02      | .03     |
| Soybean       | .72      | .66     |
| Car           | .07      | .05     |
| Balance Scale | .17      | .11     |
| Breast Cancer | .04      | .03     |
| Vehicle       | .18      | .18     |

Experiment on our spark cluster setup:

An initial benchmark study was performed on a 3 node Spark cluster setup on 
mesos where each node config was 8 Cores, 64 GB RAM and the spark version used 
was 1.5(git branch).

Tests were done using a mixture of 10 Gaussians with varying number of features 
and instances. The results from the benchmark study are provided below. The 
reported stats are average over 5 runs. 

| DATASET     |            |         DPMEANS         |       |                  
       | KMEANS (k =10) |                         |
| Instances   | Dimensions | No of clusters obtained | Time  | Converged in 
iterations |      Time      | Converged in iterations |
|  10 million |     10     |            10           | 43.6s |            2     
       |      52.2s     |            2            |
|  1 million  |     100    |            10           | 39.8s |            2     
       |     43.39s     |            2            |
| 0.1 million |    1000    |            10           | 37.3s |            2     
       |     41.64s     |            2            |

  was:
At present, all the clustering algorithms in MLlib require the number of 
clusters to be specified in advance. 
The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
that allows for flexible clustering of data without having to specify apriori 
the number of clusters. 
DP means is a non-parametric clustering algorithm that uses a scale parameter 
'lambda' to control the creation of new clusters["Revisiting k-means: New 
Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].

We have followed the distributed implementation of DP means which has been 
proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
by Xinghao Pan, Evan R. Sparks, Andre Wibisono.


> DP means clustering 
> --------------------
>
>                 Key: SPARK-8402
>                 URL: https://issues.apache.org/jira/browse/SPARK-8402
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Meethu Mathew
>            Assignee: Meethu Mathew
>              Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of 
> clusters to be specified in advance. 
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
> that allows for flexible clustering of data without having to specify apriori 
> the number of clusters. 
> DP means is a non-parametric clustering algorithm that uses a scale parameter 
> 'lambda' to control the creation of new clusters ["Revisiting k-means: New 
> Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been 
> proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
> by Xinghao Pan, Evan R. Sparks, Andre Wibisono.
> A benchmark comparison between k-means and dp-means based on Normalized 
> Mutual Information between ground truth clusters and algorithm outputs, have 
> been provided in the following table. It can be seen from the table that 
> DP-means reported a higher NMI on 5 of 8 data sets in comparison to 
> k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms 
> via Bayesian nonparametrics (2011) Arxiv:1111.0352. (Table 1)]
> | Dataset       | DP-means | k-means |
> | Wine          | .41      | .43     |
> | Iris          | .75      | .76     |
> | Pima          | .02      | .03     |
> | Soybean       | .72      | .66     |
> | Car           | .07      | .05     |
> | Balance Scale | .17      | .11     |
> | Breast Cancer | .04      | .03     |
> | Vehicle       | .18      | .18     |
> Experiment on our spark cluster setup:
> An initial benchmark study was performed on a 3 node Spark cluster setup on 
> mesos where each node config was 8 Cores, 64 GB RAM and the spark version 
> used was 1.5(git branch).
> Tests were done using a mixture of 10 Gaussians with varying number of 
> features and instances. The results from the benchmark study are provided 
> below. The reported stats are average over 5 runs. 
> | DATASET     |            |         DPMEANS         |       |                
>          | KMEANS (k =10) |                         |
> | Instances   | Dimensions | No of clusters obtained | Time  | Converged in 
> iterations |      Time      | Converged in iterations |
> |  10 million |     10     |            10           | 43.6s |            2   
>          |      52.2s     |            2            |
> |  1 million  |     100    |            10           | 39.8s |            2   
>          |     43.39s     |            2            |
> | 0.1 million |    1000    |            10           | 37.3s |            2   
>          |     41.64s     |            2            |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to