[ 
https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152730#comment-14152730
 ] 

Patrick Wendell commented on SPARK-3504:
----------------------------------------

I just updated the title to make it more descriptive. Not an expert on this 
part of the code!

> KMeans optimization: track distances and unmoved cluster centers across 
> iterations
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-3504
>                 URL: https://issues.apache.org/jira/browse/SPARK-3504
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.0.2
>            Reporter: Derrick Burns
>
> The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because 
> recomputes all distances to all cluster centers on each iteration.  In later 
> iterations of Lloyd's algorithm, points don't change clusters and clusters 
> don't move.  
> By 1) tracking which clusters move and 2) tracking for each point which 
> cluster it belongs to and the distance to that cluster, one can avoid 
> recomputing distances in many cases with very little increase in memory 
> requirements. 
> I implemented this new algorithm and the results were fantastic. Using 16 
> c3.8xlarge machines on EC2,  the clusterer converged in 13 iterations on 
> 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here 
> are the running times for the first 7 rounds:
> 6 minutes and 42 second
> 7 minutes and 7 seconds
> 7 minutes 13 seconds
> 1 minutes 18 seconds
> 30 seconds
> 18 seconds
> 12 seconds
> Without this improvement, all rounds would have taken roughly 7 minutes, 
> resulting in Lloyd's iterations taking  7 * 13 = 91 minutes. In other words, 
> this improvement resulting in a reduction of roughly 75% in running time with 
> no loss of accuracy.
> My implementation is a rewrite of the existing 1.0.2 implementation.  It is 
> not a simple modification of the existing implementation.  Please let me know 
> if you are interested in this new implementation.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to