[ https://issues.apache.org/jira/browse/SPARK-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140623#comment-15140623 ]
Tsai Li Ming commented on SPARK-3220: ------------------------------------- I built Derrick's kmeans against Spark 1.6.0 and ran {code} import com.massivedatascience.clusterer.KMeans val clusters = KMeans.train(parsedData, numClusters, numIterations) {code} It took 41mins with the same dataset/settings compared to 1hr using Mllib. In both cases, there was enough memory to cache everything. > K-Means clusterer should perform K-Means initialization in parallel > ------------------------------------------------------------------- > > Key: SPARK-3220 > URL: https://issues.apache.org/jira/browse/SPARK-3220 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Derrick Burns > Labels: clustering > > The LocalKMeans method should be replaced with a parallel implementation. As > it stands now, it becomes a bottleneck for large data sets. > I have implemented this functionality in my version of the clusterer. > However, I see that there are hundreds of outstanding pull requests. If > someone on the team wants to sponsor the pull request, I will create one. > Otherwise, I will just maintain my own private fork of the clusterer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org