[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006060#comment-16006060 ]
Nick Pentreath commented on SPARK-8402: --------------------------------------- I'm afraid I would say there is not sufficient demand or committer bandwidth for this feature at present. > Add DP means clustering to MLlib > -------------------------------- > > Key: SPARK-8402 > URL: https://issues.apache.org/jira/browse/SPARK-8402 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Meethu Mathew > Assignee: Meethu Mathew > Labels: features > > At present, all the clustering algorithms in MLlib require the number of > clusters to be specified in advance. > The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model > that allows for flexible clustering of data without having to specify apriori > the number of clusters. > DP means is a non-parametric clustering algorithm that uses a scale parameter > 'lambda' to control the creation of new clusters ["Revisiting k-means: New > Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan]. > We have followed the distributed implementation of DP means which has been > proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" > by Xinghao Pan, Evan R. Sparks, Andre Wibisono. > A benchmark comparison between k-means and dp-means based on Normalized > Mutual Information between ground truth clusters and algorithm outputs, have > been provided in the following table. It can be seen from the table that > DP-means reported a higher NMI on 5 of 8 data sets in comparison to > k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms > via Bayesian nonparametrics (2011) Arxiv:1111.0352. (Table 1)] > | Dataset | DP-means | k-means | > | Wine | .41 | .43 | > | Iris | .75 | .76 | > | Pima | .02 | .03 | > | Soybean | .72 | .66 | > | Car | .07 | .05 | > | Balance Scale | .17 | .11 | > | Breast Cancer | .04 | .03 | > | Vehicle | .18 | .18 | > Experiment on our spark cluster setup: > An initial benchmark study was performed on a 3 node Spark cluster setup on > mesos where each node config was 8 Cores, 64 GB RAM and the spark version > used was 1.5(git branch). > Tests were done using a mixture of 10 Gaussians with varying number of > features and instances. The results from the benchmark study are provided > below. The reported stats are average over 5 runs. > | DATASET | | DPMEANS | | > | KMEANS (k =10) | | > | Instances | Dimensions | No of clusters obtained | Time | Converged in > iterations | Time | Converged in iterations | > | 10 million | 10 | 10 | 43.6s | 2 > | 52.2s | 2 | > | 1 million | 100 | 10 | 39.8s | 2 > | 43.39s | 2 | > | 0.1 million | 1000 | 10 | 37.3s | 2 > | 41.64s | 2 | -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org