[jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans

Yu Ishikawa (JIRA) Thu, 30 Oct 2014 08:03:12 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190166#comment-14190166
 ]


Yu Ishikawa edited comment on SPARK-2429 at 10/30/14 3:02 PM:
--------------------------------------------------------------

I compared training and predicting elapsed times of the hierarchical clustering 
with them of kmeans.
In fact, the theoretical computational complexity of hierarchical clustering 
assingment is smaller than that of kmeans.
However, not only predicting time but also predicting time of the hierarchical 
clustering are slower than them of kmeans.

I used the below url's program for this experiment.
https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/37488e306d583d0e1743bff432165e8c1bf4465e/src/main/scala/CompareWithKMeansApp.scala

h3. Spark Cluster Specification

I run it on EC2 under the below specification.

- Master Instance Type: r3.large
- Slave Instance Type: r3.8xlarge
-- Cores: 32
-- Memory: 244GB
- # of Slaves: 5
-- Total Cores: 160
-- Total Memory: 1220GB

h3. The Performance Result

{noformat}
{"maxCores" : "160", "numClusters" : "50", "dimension" : "500", "rows" : 
"1000000", "numPartitions" : "160"}
KMeans Training Elapsed Time: 28.179 [sec]
KMeans Predicting Elapsed Time: 0.011 [sec]
Hierarchical Training Elapsed Time: 46.539 [sec]
Hierarchical Predicting Elapsed Time: 0.3076923076923077 [sec]

{"maxCores" : "160", "numClusters" : "50", "dimension" : "500", "rows" : 
"5000000", "numPartitions" : "160"}
KMeans Training Elapsed Time: 55.187 [sec]
KMeans Predicting Elapsed Time: 0.008 [sec]
Hierarchical Training Elapsed Time: 210.238 [sec]
Hierarchical Predicting Elapsed Time: 0.3906093906093906 [sec]
{noformat}



was (Author: yuu.ishik...@gmail.com):
I compared training and predicting elapsed times of the hierarchical clustering 
with them of kmeans.
In fact, the theoretical computational complexity of hierarchical clustering 
assingment is smaller than that of kmeans.
However, not only predicting time but also predicting time of the hierarchical 
clustering are slower than them of kmeans.

I used the below url's program for this experiment.
https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/37488e306d583d0e1743bff432165e8c1bf4465e/src/main/scala/CompareWithKMeansApp.scala

h3. Spark Cluster Specification

I run it on EC2 under the below specification.

- Master Instance Type: r3.large
- Slave Instance Type: r3.8xlarge
-- Cores: 32
-- Memory: 244GB
- # of Slaves: 5
-- Total Cores: 160
-- Total Memory: 1220GB

h3. The Performance Result

{noformat}
{"maxCores" : "160", "numClusters" : "50", "dimension" : "500", "rows" : 
"1000000", "numPartitions" : "160"}
KMeans Training Elappsed Time: 28.179 [sec]
KMeans Predicting Elappsed Time: 0.011 [sec]
Hierarchical Training Elappsed Time: 46.539 [sec]
Hierarchical Predicting Elappsed Time: 0.3076923076923077 [sec]

{"maxCores" : "160", "numClusters" : "50", "dimension" : "500", "rows" : 
"5000000", "numPartitions" : "160"}
KMeans Training Elappsed Time: 55.187 [sec]
KMeans Predicting Elappsed Time: 0.008 [sec]
Hierarchical Training Elappsed Time: 210.238 [sec]
Hierarchical Predicting Elappsed Time: 0.3906093906093906 [sec]
{noformat}


> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans

Reply via email to