Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361)
For divisive, hierarchical clustering I implemented something awhile back, here's a gist. https://gist.github.com/freeman-lab/5947e7c53b368fe90371 It does bisecting k-means clustering (with k=2), with a recursive class for keeping track of the tree. I also found this much better than agglomerative methods (for the reasons Hector points out). This needs to be cleaned up, and can surely be optimized (esp. by replacing the core KMeans step with existing MLLib code), but I can say I was running it successfully on quite large data sets. RJ, depending on where you are in your progress, I'd be happy to help work on this piece and / or have you use this as a jumping off point, if useful. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.