Re: kmeans|| in Spark is not real paralleled?

2015-04-03 Thread Xi Shen
Hi Xingrui, I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and attached the sample code. But I could not attache the test data. I will update the bug once I found a place to host the test data. Thanks, David On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng men...@gmail.com

Re: kmeans|| in Spark is not real paralleled?

2015-03-30 Thread Xiangrui Meng
This PR updated the k-means|| initialization: https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d, which was included in 1.3.0. It should fix kmean|| initialization with large k. Please create a JIRA for this issue and send me the code and the dataset to produce this

kmeans|| in Spark is not real paralleled?

2015-03-29 Thread Xi Shen
Hi, I have opened a couple of threads asking about k-means performance problem in Spark. I think I made a little progress. Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It uses the kmeans|| initialization algorithm which supposedly to be a faster version of kmeans++ and