Number of rows doesn't matter much as long as you have enough workers to distribute the work. K-means has complexity O(n * d * k), where n is number of points, d is the dimension, and k is the number of clusters. If you use the KMeans implementation from MLlib, the initialization stage is done on master, so a large k would slow down the initialization stage. If your data is sparse, the latest change to KMeans will help with the speed, depending on how sparse your data is. -Xiangrui
On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming <mailingl...@ltsai.com> wrote: > Thanks, Let me try with a smaller K. > > Does the size of the input data matters for the example? Currently I have 50M > rows. What is a reasonable size to demonstrate the capability of Spark? > > > > > > On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng <men...@gmail.com> wrote: > >> K = 500000 is certainly a large number for k-means. If there is no >> particular reason to have 500000 clusters, could you try to reduce it >> to, e.g, 100 or 1000? Also, the example code is not for large-scale >> problems. You should use the KMeans algorithm in mllib clustering for >> your problem. >> >> -Xiangrui >> >> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming <mailingl...@ltsai.com> wrote: >>> Hi, >>> >>> This is on a 4 nodes cluster each with 32 cores/256GB Ram. >>> >>> (0.9.0) is deployed in a stand alone mode. >>> >>> Each worker is configured with 192GB. Spark executor memory is also 192GB. >>> >>> This is on the first iteration. K=500000. Here's the code I use: >>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. >>> >>> Thanks! >>> >>> >>> >>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng <men...@gmail.com> wrote: >>> >>>> Hi Tsai, >>>> >>>> Could you share more information about the machine you used and the >>>> training parameters (runs, k, and iterations)? It can help solve your >>>> issues. Thanks! >>>> >>>> Best, >>>> Xiangrui >>>> >>>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming <mailingl...@ltsai.com> >>>> wrote: >>>>> Hi, >>>>> >>>>> At the reduceBuyKey stage, it takes a few minutes before the tasks start >>>>> working. >>>>> >>>>> I have -Dspark.default.parallelism=127 cores (n-1). >>>>> >>>>> CPU/Network/IO is idling across all nodes when this is happening. >>>>> >>>>> And there is nothing particular on the master log file. From the >>>>> spark-shell: >>>>> >>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 >>>>> on executor 2: XXX (PROCESS_LOCAL) >>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as >>>>> 38765155 bytes in 193 ms >>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 >>>>> on executor 1: XXX (PROCESS_LOCAL) >>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as >>>>> 38765155 bytes in 96 ms >>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 >>>>> on executor 0: XXX (PROCESS_LOCAL) >>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as >>>>> 38765155 bytes in 100 ms >>>>> >>>>> But it stops there for some significant time before any movement. >>>>> >>>>> In the stage detail of the UI, I can see that there are 127 tasks running >>>>> but the duration each is at least a few minutes. >>>>> >>>>> I'm working off local storage (not hdfs) and the kmeans data is about >>>>> 6.5GB (50M rows). >>>>> >>>>> Is this a normal behaviour? >>>>> >>>>> Thanks! >>> >