[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591009#comment-14591009 ]
Joseph K. Bradley commented on SPARK-8402: ------------------------------------------ Feel free to go ahead and work on it. I have not heard of too many users needing this feature, however, so it might be worth polling for interest/need (e.g., on the dev list). Often, people just try several numbers of means and pick the smallest which gives decent results on their data. But it's definitely worth considering. Thanks! Btw, I'll remove the target version. A committer should set that since it's meant to be a commitment to get a feature in for a particular release. > DP means clustering > -------------------- > > Key: SPARK-8402 > URL: https://issues.apache.org/jira/browse/SPARK-8402 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Meethu Mathew > Labels: features > > At present, all the clustering algorithms in MLlib require the number of > clusters to be specified in advance. > The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model > that allows for flexible clustering of data without having to specify apriori > the number of clusters. > DP means is a non-parametric clustering algorithm that uses a scale parameter > 'lambda' to control the creation of new clusters["Revisiting k-means: New > Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan]. > We have followed the distributed implementation of DP means which has been > proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" > by Xinghao Pan, Evan R. Sparks, Andre Wibisono. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org