[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227352#comment-14227352
 ] 

Yu Ishikawa commented on SPARK-2429:
------------------------------------

Hi [~yangjunpro], 

{quote}
1. In your implementation, in each divisive steps, there will be a "copy" 
operations to distribution the data nodes in the parent cluster tree to the 
split children cluster trees, when the document size is large, I think this 
copy cost is non-neglectable, right?
{quote}

Exactly. A cached memory is twice larger than the original data. For example, 
if data size is 10 GB, a spark cluster always has 20 GB cached RDD through the 
algorithm. The reason why I cache the data nodes at each time dividing is for 
elapsed time. That is, this algorithm is very slow without caching the data 
nodes.

{quote}
2. In your test code, the cluster size is not quite large( only about 100 ), 
have you ever tested it with big cluster size and big document corpus? e.g., 
10000 clusters with 2000000 documents. What is the performance behavior facing 
this kind of use case?
{quote}

The test code deals with small data as you said. I think data size in unit 
tests should not be large in order to reduce the test time. Of course, I am 
willing to implement this algorithm to fit large input data and large clusters 
we want. Although I have never check the performance of this implementation 
with large clusters, such as 10000, elapsed time can be long. I will check the 
performance under the condition. Or if possible, could you check the 
performance?

thanks


> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to