Hi Xingrui, I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and attached the sample code. But I could not attache the test data. I will update the bug once I found a place to host the test data.
Thanks, David On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng <men...@gmail.com> wrote: > This PR updated the k-means|| initialization: > https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d, > which was included in 1.3.0. It should fix kmean|| initialization with > large k. Please create a JIRA for this issue and send me the code and the > dataset to produce this problem. Thanks! -Xiangrui > > On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen <davidshe...@gmail.com> wrote: > >> Hi, >> >> I have opened a couple of threads asking about k-means performance >> problem in Spark. I think I made a little progress. >> >> Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). >> It uses the "kmeans||" initialization algorithm which supposedly to be a >> faster version of kmeans++ and give better results in general. >> >> But I observed that if the k is very large, the initialization step takes >> a long time. From the CPU utilization chart, it looks like only one thread >> is working. Please see >> https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark >> . >> >> I read the paper, >> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it >> points out kmeans++ initialization algorithm will suffer if k is large. >> That's why the paper contributed the kmeans|| algorithm. >> >> >> If I invoke KMeans.train by using the random initialization algorithm, I >> do not observe this problem, even with very large k, like k=5000. This >> makes me suspect that the kmeans|| in Spark is not properly implemented and >> do not utilize parallel implementation. >> >> >> I have also tested my code and data set with Spark 1.3.0, and I still >> observe this problem. I quickly checked the PR regarding the KMeans >> algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement >> and polish, not changing/improving the algorithm. >> >> >> I originally worked on Windows 64bit environment, and I also tested on >> Linux 64bit environment. I could provide the code and data set if anyone >> want to reproduce this problem. >> >> >> I hope a Spark developer could comment on this problem and help >> identifying if it is a bug. >> >> >> Thanks, >> >> [image: --] >> Xi Shen >> [image: http://]about.me/davidshen >> <http://about.me/davidshen?promo=email_sig> >> <http://about.me/davidshen> >> > >