[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357497#comment-14357497
 ] 

Jeremy Freeman edited comment on SPARK-2429 at 3/11/15 8:10 PM:
----------------------------------------------------------------

Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing experimental pieces of functionality. But I'd probably prioritize just 
reviewing the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.


was (Author: freeman-lab):
Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing new pieces of functionality. But I'd probably prioritize just reviewing 
the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.

> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>              Labels: clustering
>         Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to