[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213942#comment-14213942 ]
Jun Yang commented on SPARK-2429: --------------------------------- Hi Yu Ishikawa Thanks for your wonderful hierarchical implementation of KMeans, which just meets one of our project requirement :) In our project, we initially used a MPI-based HAC implementation to do agglomeration bottom-up hierarchical clustering, and since we want to migrate the entire back-end pipeline to Spark, we just look for the alike hierarchical clustering implementation on Spark or we need to write it by ourselves. >From functionality perspective, you implementation looks pretty good( I have >already read through your code), but I still have several questions regarding >to performance and scalability: 1. In your implementation, in each divisive steps, there will be a "copy" operations to distribution the data nodes in the parent cluster tree to the split children cluster trees, when the document size is large, I think this copy cost is non-neglectable, right? A potential optimization method is to keep the entire document data cached, and in each divisive steps, we just record the index of the documents into the ClusterTree object, so the cost could be lowered quite a lot. Does this idea make sense? 2. In your test code, the cluster size is not quite large( only about 100 ), have you ever tested it with big cluster size and big document corpus? e.g., 10000 clusters with 2000000 documents. What is the performance behavior facing this kind of use case? Since in production environment, this use case is usually typical. Look forward to your reply. Thanks > Hierarchical Implementation of KMeans > ------------------------------------- > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: RJ Nowling > Assignee: Yu Ishikawa > Priority: Minor > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org