[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165480#comment-14165480
 ] 

RJ Nowling commented on SPARK-2429:
-----------------------------------

Great work, Yu!

Ok, first off, let me make sure I understand what you're doing.  You start with 
2 centers.  You assign all the points.  You then apply KMeans recursively to 
each cluster, splitting each center into 2 centers.  Each instance of KMeans 
stops when the error is below a certain value or a fixed number of iterations 
have been run.

I think your analysis of the overall run time is good and probably what we 
expect.  Can you break down the timing to see which parts are the most 
expensive?  Maybe we can figure out where to optimize it.

A few thoughts on optimization:
1. It might be good to convert everything to Breeze vectors before you do any 
operations -- you need to convert the same vectors over and over again.  KMeans 
converts them at the beginning and converts the vectors for the centers back at 
the end.

2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, 
look into using a broadcast variable.  See the latest KMeans implementation. 
This could improve performance by 10%+.

3. You may want to look into using reduceByKey or similar RDD operations -- 
they will enable parallel reductions which will be faster than a loop on the 
master.

If you look at the JIRAs and PRs, there is some recent work to speed up KMeans 
-- maybe some of that is applicable?

I'll probably have more questions -- it's a good way of helping me understand 
what you're doing :)

> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to