This PR updated the k-means|| initialization: https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d, which was included in 1.3.0. It should fix kmean|| initialization with large k. Please create a JIRA for this issue and send me the code and the dataset to produce this problem. Thanks! -Xiangrui
On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen <davidshe...@gmail.com> wrote: > Hi, > > I have opened a couple of threads asking about k-means performance problem > in Spark. I think I made a little progress. > > Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It > uses the "kmeans||" initialization algorithm which supposedly to be a > faster version of kmeans++ and give better results in general. > > But I observed that if the k is very large, the initialization step takes > a long time. From the CPU utilization chart, it looks like only one thread > is working. Please see > https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark > . > > I read the paper, > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it points > out kmeans++ initialization algorithm will suffer if k is large. That's why > the paper contributed the kmeans|| algorithm. > > > If I invoke KMeans.train by using the random initialization algorithm, I > do not observe this problem, even with very large k, like k=5000. This > makes me suspect that the kmeans|| in Spark is not properly implemented and > do not utilize parallel implementation. > > > I have also tested my code and data set with Spark 1.3.0, and I still > observe this problem. I quickly checked the PR regarding the KMeans > algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement > and polish, not changing/improving the algorithm. > > > I originally worked on Windows 64bit environment, and I also tested on > Linux 64bit environment. I could provide the code and data set if anyone > want to reproduce this problem. > > > I hope a Spark developer could comment on this problem and help > identifying if it is a bug. > > > Thanks, > > [image: --] > Xi Shen > [image: http://]about.me/davidshen > <http://about.me/davidshen?promo=email_sig> > <http://about.me/davidshen> >