[ https://issues.apache.org/jira/browse/SPARK-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Abraham Zhan updated SPARK-15346: --------------------------------- Target Version/s: 2.0.0 > Reduce duplicate computation in picking initial points in LocalKMeans > --------------------------------------------------------------------- > > Key: SPARK-15346 > URL: https://issues.apache.org/jira/browse/SPARK-15346 > Project: Spark > Issue Type: Improvement > Environment: Ubuntu 14.04 > Reporter: Abraham Zhan > Assignee: Abraham Zhan > Priority: Minor > Labels: performance > Fix For: 2.0.0 > > > h2.Main Issue > I found that for KMans|| in mllib, when dataset is in large scale, after > initial KMeans|| finishes and before Lloyd's iteration begins, the program > will stuck for a long time without terminal. After testing I see it's stucked > with LocalKMeans. And there is a to be improved feature in LocalKMeans.scala > in Mllib. After picking each new initial centers, it's unnecessary to compute > the distances between all the points and the old centers as below > {code:scala} > val costArray = points.map { point => > KMeans.fastSquaredDistance(point, centers(0)) > } > {code} > Instead this we can keep the distance between all the points and their > closest centers, and compare to the distance of them with the new center then > update them. > h2.Test > Download > [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip] > I provided a attach "LocalKMeans.zip" which contains the code > "LocalKMeans2.scala" and dataset "bigKMeansMedia" > LocalKMeans2.scala contains both original version method KMeansPlusPlus and a > modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0) > I added a tests and main function in it so that any one can run the file > directly. > h3.How to Test > Replacing mllib.clustering.LocalKMeans.scala in your local repository with my > LocalKMeans2.scala or just put them in the same dir. > Modify the path in line 34 (loadAndRun()) with the path you restoring the > data file bigKMeansMedia which is also provided in the patch. > Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed > to clustering number K and iteration number respectively. > Then the console will print the cost time and SE of the two version of > KMeans++ respectively. > h2.Test Results > This data is generated from a KMeans|| eperiment in spark, I add some inner > function and output the result of KMeans|| initialization and restore. > The first line of the file with format "%d:%d:%d:%d" indicates "the > seed:feature num:iteration num (in original KMeans||):points num" of the > data. > In my machine the experiment result is as below: > !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg! > the x-axis is the clustering num k while y-axis is the time in seconds -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org