[ https://issues.apache.org/jira/browse/SPARK-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15481490#comment-15481490 ]
Apache Spark commented on SPARK-17389: -------------------------------------- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/15050 > KMeans speedup with better choice of k-means|| init steps = 2 > ------------------------------------------------------------- > > Key: SPARK-17389 > URL: https://issues.apache.org/jira/browse/SPARK-17389 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.0.0 > Reporter: Sean Owen > Assignee: Sean Owen > Priority: Minor > Fix For: 2.1.0 > > > As reported in > http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 > KMeans can be surprisingly slow, and it's easy to see that most of the time > spent is in kmeans|| initialization. For example, in this simple example... > {code} > import org.apache.spark.mllib.random.RandomRDDs > import org.apache.spark.mllib.clustering.KMeans > val data = RandomRDDs.uniformVectorRDD(sc, 1000000, 64, > sc.defaultParallelism).cache() > data.count() > new KMeans().setK(1000).setMaxIterations(5).run(data) > {code} > Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init > takes about as long as 24 iterations, which is a typical run, meaning half > the time is just in picking cluster centers. This seems excessive. > There are two ways to speed this up significantly. First, the implementation > has an old "runs" parameter that is always 1 now. It used to allow multiple > clusterings to be computed at once. The code can be simplified significantly > now that runs=1 always. This is already covered by SPARK-11560, but just a > simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in > this example. That's not what this change is about though. > By default, k-means|| makes 5 passes over the data. The original paper at > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows > that 2 is plenty, certainly when l=2k as is the case in our implementation. > (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but > it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to > 1:41 -- much improved over 5:54. > Lastly, small thing, but the code will perform a local k-means++ step to > reduce the number of centers to k even if there are already only <= k > centers. This can be short-circuited. However this is really the topic of > SPARK-3261 because this can cause fewer than k clusters to be returned where > that would actually be correct, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org