[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179974#comment-14179974 ]
Yu Ishikawa commented on SPARK-2429: ------------------------------------ {quote} Can you add a breakdown of the timings for each part of the algorithm? (e.g, like you did to find out which parts were slowest?) You don't need to do a sweep over multiple data sizes or number of data points – just pick a representative number of data point and rows. {quote} Yes, I can. But I only embedded the debug code in each part of the algorithm. I don't support logging messages in the algorithm yet. So I will be able to log the each part execution time of the algorithm. Or do you want to get the each part them as variables like `HierarchicalClusteringModel.trainTime`? {quote} Have you compared the performance of the hierarchical KMeans vs KMeans implemented in MLLib? I expect that the hierarchical will be slower to cluster but the assignment should be faster (O(log k) vs O(k)). This improvement in assignment speed is the motivation for including the hierarchical KMeans in Spark. {quote} No, I haven't yet. I agree with that the hierarchical clustering will be slower than the k-means. The assignment speed of the algorithm is as fast as that of k-means in `HierarchicalClusteringModel.predict`. We should improve `ClusterTree` data structure like B-tree in order to search by its index. Do you have any good idea to improve it? There is something I want to talk to you about. I know we should support for distance metrics other than Euclidean such as cosine distance, and I want to. However, I don't know the best way to support distance functions in MLlib as a common library yet. So I have a suggestion about the release of the algorithm. At first, we will merge it with only Euclidean distance metric. And then we will support other distance metrics at another issue. What do you think about it? Thanks > Hierarchical Implementation of KMeans > ------------------------------------- > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: RJ Nowling > Assignee: Yu Ishikawa > Priority: Minor > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org