[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166593#comment-14166593
 ] 

Yu Ishikawa commented on SPARK-2429:
------------------------------------

Hi [~rnowling],

Thank you for your comments and advices.

{quote}
Ok, first off, let me make sure I understand what you're doing. You start with 
2 centers. You assign all the points. You then apply KMeans recursively to each 
cluster, splitting each center into 2 centers. Each instance of KMeans stops 
when the error is below a certain value or a fixed number of iterations have 
been run.
{quote}
You are right. The algorithm runs as you said.

{quote}
I think your analysis of the overall run time is good and probably what we 
expect. Can you break down the timing to see which parts are the most 
expensive? Maybe we can figure out where to optimize it.
{quote}
OK. I will measure the execution time of parts of the implementation.

{quote}
1. It might be good to convert everything to Breeze vectors before you do any 
operations – you need to convert the same vectors over and over again. KMeans 
converts them at the beginning and converts the vectors for the centers back at 
the end.
{quote}
I agree with you. I am troubled with this problem. After training the model, 
the user seems to select  the data in a cluster which is the part of the whole 
input data. I think there are three approaches to realize it as below.

# We extract the centers and their `RDD \[Vector\]` data in a cluster through 
the training like my implementation
# We extract the centers and their `RDD\[BV\[Double\]\]` data, and then convert 
the data into `RDD\[Vector\]` at the last.
The converting from breeze vectors to spark vectors is very slow. That's why we 
didn't implement it.
# We only extract the centers through the training, not their data. And then we 
apply the trained model to the input data with `predict` method like 
scikit-lean in order to extract the part of the data in each cluster.
This seems to be good. We have to save the `RDD\[BV\[Double\]\]` data of each 
cluster thorough the clustering. Because we extract the `RDD\[Vector\]` data of 
each cluster after the training, I am worried that it is a waste of the 
`RDD\[DB\[Double\]\]` data through the clustering. And I am troubled with how 
to elegantly save the data in progress of the clustering.

{quote}
2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, 
look into using a broadcast variable. See the latest KMeans implementation. 
This could improve performance by 10%+.

3. You may want to look into using reduceByKey or similar RDD operations – they 
will enable parallel reductions which will be faster than a loop on the master.
{quote}
I will give it a try. Thanks!

> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to