Re: kmeans|| in Spark is not real paralleled?

Xi Shen Fri, 03 Apr 2015 18:47:26 -0700

Hi Xingrui,

I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and
attached the sample code. But I could not attache the test data. I will
update the bug once I found a place to host the test data.



Thanks,
David


On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng <men...@gmail.com> wrote:

> This PR updated the k-means|| initialization:
> https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d,
> which was included in 1.3.0. It should fix kmean|| initialization with
> large k. Please create a JIRA for this issue and send me the code and the
> dataset to produce this problem. Thanks! -Xiangrui
>
> On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen <davidshe...@gmail.com> wrote:
>
>> Hi,
>>
>> I have opened a couple of threads asking about k-means performance
>> problem in Spark. I think I made a little progress.
>>
>> Previous I use the simplest way of KMeans.train(rdd, k, maxIterations).
>> It uses the "kmeans||" initialization algorithm which supposedly to be a
>> faster version of kmeans++ and give better results in general.
>>
>> But I observed that if the k is very large, the initialization step takes
>> a long time. From the CPU utilization chart, it looks like only one thread
>> is working. Please see
>> https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
>> .
>>
>> I read the paper,
>> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it
>> points out kmeans++ initialization algorithm will suffer if k is large.
>> That's why the paper contributed the kmeans|| algorithm.
>>
>>
>> If I invoke KMeans.train by using the random initialization algorithm, I
>> do not observe this problem, even with very large k, like k=5000. This
>> makes me suspect that the kmeans|| in Spark is not properly implemented and
>> do not utilize parallel implementation.
>>
>>
>> I have also tested my code and data set with Spark 1.3.0, and I still
>> observe this problem. I quickly checked the PR regarding the KMeans
>> algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
>> and polish, not changing/improving the algorithm.
>>
>>
>> I originally worked on Windows 64bit environment, and I also tested on
>> Linux 64bit environment. I could provide the code and data set if anyone
>> want to reproduce this problem.
>>
>>
>> I hope a Spark developer could comment on this problem and help
>> identifying if it is a bug.
>>
>>
>> Thanks,
>>
>> [image: --]
>> Xi Shen
>> [image: http://]about.me/davidshen
>> <http://about.me/davidshen?promo=email_sig>
>>   <http://about.me/davidshen>
>>
>
>

Re: kmeans|| in Spark is not real paralleled?

Reply via email to