[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-2429:
-------------------------------
    Attachment: benchmark2.html

HI [~rnowling], 

I improved the performance of my implementation. Could you check it?
I can dramatically improve its performance. It runs about 6 times faster than 
previous version. For example, the training execution time under the number of 
input data rows is 1000000, the number of given clusters is 100 and the their 
dimensions is 100 is from 5293 seconds to 871 seconds.

- Source Code
https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/master/src%2Fmain%2Fscala%2Forg.apache.spark.mllib.clustering%2FHierarchicalClustering.scala
- Test Code
https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/master/src%2Ftest%2Fscala%2Forg.apache.spark.mllib.clustering%2FHierarchicalClusteringSuite.scala

There are two big bottleneck in previous my implementation as a result of 
inspecting it. C
# I didn't use cache of each clustering step effectively
# Generating the initial centers of each step is very slow

h3. Changes about the Source Code

- Use cache to the sub data of each cluster
- modify the `takeInitCenters`
- use RDD\[BV\[Double\]\], not RDD\[Vector\]
- Modify the minor change of the algorithm

h3. Changes about the Benchmark Report

- Add the test about the model accuracy

I will send you the explanation about the algorithm later. Can I have a moment?
Best,

> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: The Result of Benchmarking a Hierarchical 
> Clustering.pdf, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to