[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

Yu Ishikawa (JIRA) Wed, 22 Oct 2014 07:47:49 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179974#comment-14179974
 ]


Yu Ishikawa commented on SPARK-2429:
------------------------------------

{quote}
Can you add a breakdown of the timings for each part of the algorithm? (e.g, 
like you did to find out which parts were slowest?) You don't need to do a 
sweep over multiple data sizes or number of data points – just pick a 
representative number of data point and rows.
{quote}

Yes, I can. But I only embedded the debug code in each part of the algorithm. I 
don't support logging messages in the algorithm yet. So I will be able to log 
the each part execution time of the algorithm. Or do you want to get the each 
part them as variables like `HierarchicalClusteringModel.trainTime`?

{quote}
Have you compared the performance of the hierarchical KMeans vs KMeans 
implemented in MLLib? I expect that the hierarchical will be slower to cluster 
but the assignment should be faster (O(log k) vs O(k)). This improvement in 
assignment speed is the motivation for including the hierarchical KMeans in 
Spark.
{quote}

No, I haven't yet. I agree with that the hierarchical clustering will be slower 
than the k-means. The assignment speed of the algorithm is as fast as that of 
k-means in `HierarchicalClusteringModel.predict`. We should improve 
`ClusterTree` data structure like B-tree in order to search by its index. Do 
you have any good idea to improve it?


There is something I want to talk to you about.  I know we should support for 
distance metrics other than Euclidean such as cosine distance, and I want to. 
However, I don't know the best way to support distance functions in MLlib as a 
common library yet. So I have a suggestion about the release of the algorithm. 
At first, we will merge it  with  only Euclidean distance metric. And then we 
will support other distance metrics at another issue. What do you think about 
it?

Thanks

> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

Reply via email to