[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179890#comment-14179890
 ] 

RJ Nowling commented on SPARK-2429:
-----------------------------------

A 6x performance improvement is great improvement!

Can you add a breakdown of the timings for each part of the algorithm?  (e.g, 
like you did to find out which parts were slowest?)  You don't need to do a 
sweep over multiple data sizes or number of data points -- just pick a 
representative number of data point and rows.

Have you compared the performance of the hierarchical KMeans vs KMeans 
implemented in MLLib?  I expect that the hierarchical will be slower to cluster 
but the assignment should be faster (O(log k) vs O(k)).  This improvement in 
assignment speed is the motivation for including the hierarchical KMeans in 
Spark.

Thanks!

> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to