[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213942#comment-14213942
 ] 

Jun Yang commented on SPARK-2429:
---------------------------------

Hi Yu Ishikawa 

Thanks for your wonderful hierarchical implementation of KMeans, which just 
meets one of our project requirement :)

In our project, we initially used a MPI-based HAC implementation to do 
agglomeration bottom-up hierarchical clustering, and since 
we want to migrate the entire back-end pipeline to Spark, we just look for the 
alike hierarchical clustering implementation on Spark or we need to write it by 
ourselves. 

>From functionality perspective, you implementation looks pretty good( I have 
>already read through your code), but I still have several questions regarding 
>to performance and scalability:
1. In your implementation, in each divisive steps, there will be a "copy" 
operations to distribution the data nodes in the parent cluster tree to the 
split children cluster trees, when the document size is large, I think this 
copy cost is non-neglectable, right?
A potential optimization method is to keep the entire document data cached, and 
in each divisive steps, we just record the index 
of the documents into the ClusterTree object, so the cost could be lowered 
quite a lot.

Does this idea make sense?

2. In your test code, the cluster size is not quite large( only about 100 ), 
have you ever tested it with big cluster size and big document corpus?  e.g., 
10000 clusters with 2000000 documents. What is the performance behavior facing 
this kind of use case?
Since in production environment, this use case is usually typical.

Look forward to your reply. 

Thanks 


> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to