Hi Tsai,
Could you share more information about the machine you used and the
training parameters (runs, k, and iterations)? It can help solve your
issues. Thanks!
Best,
Xiangrui
On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming mailingl...@ltsai.com wrote:
Hi,
At the reduceBuyKey stage, it takes
Hi,
This is on a 4 nodes cluster each with 32 cores/256GB Ram.
(0.9.0) is deployed in a stand alone mode.
Each worker is configured with 192GB. Spark executor memory is also 192GB.
This is on the first iteration. K=50. Here’s the code I use:
http://pastebin.com/2yXL3y8i , which is a
K = 50 is certainly a large number for k-means. If there is no
particular reason to have 50 clusters, could you try to reduce it
to, e.g, 100 or 1000? Also, the example code is not for large-scale
problems. You should use the KMeans algorithm in mllib clustering for
your problem.
Thanks, Let me try with a smaller K.
Does the size of the input data matters for the example? Currently I have 50M
rows. What is a reasonable size to demonstrate the capability of Spark?
On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng men...@gmail.com wrote:
K = 50 is certainly a large
Number of rows doesn't matter much as long as you have enough workers
to distribute the work. K-means has complexity O(n * d * k), where n
is number of points, d is the dimension, and k is the number of
clusters. If you use the KMeans implementation from MLlib, the
initialization stage is done on
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui
On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming mailingl...@ltsai.com wrote:
Thanks again.
If you use the KMeans implementation from MLlib, the
initialization stage is done on master,
The master here is the