Thanks, Let me try with a smaller K. Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark?
On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng <men...@gmail.com> wrote: > K = 500000 is certainly a large number for k-means. If there is no > particular reason to have 500000 clusters, could you try to reduce it > to, e.g, 100 or 1000? Also, the example code is not for large-scale > problems. You should use the KMeans algorithm in mllib clustering for > your problem. > > -Xiangrui > > On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming <mailingl...@ltsai.com> wrote: >> Hi, >> >> This is on a 4 nodes cluster each with 32 cores/256GB Ram. >> >> (0.9.0) is deployed in a stand alone mode. >> >> Each worker is configured with 192GB. Spark executor memory is also 192GB. >> >> This is on the first iteration. K=500000. Here's the code I use: >> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. >> >> Thanks! >> >> >> >> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng <men...@gmail.com> wrote: >> >>> Hi Tsai, >>> >>> Could you share more information about the machine you used and the >>> training parameters (runs, k, and iterations)? It can help solve your >>> issues. Thanks! >>> >>> Best, >>> Xiangrui >>> >>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming <mailingl...@ltsai.com> wrote: >>>> Hi, >>>> >>>> At the reduceBuyKey stage, it takes a few minutes before the tasks start >>>> working. >>>> >>>> I have -Dspark.default.parallelism=127 cores (n-1). >>>> >>>> CPU/Network/IO is idling across all nodes when this is happening. >>>> >>>> And there is nothing particular on the master log file. From the >>>> spark-shell: >>>> >>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on >>>> executor 2: XXX (PROCESS_LOCAL) >>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 >>>> bytes in 193 ms >>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on >>>> executor 1: XXX (PROCESS_LOCAL) >>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 >>>> bytes in 96 ms >>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on >>>> executor 0: XXX (PROCESS_LOCAL) >>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 >>>> bytes in 100 ms >>>> >>>> But it stops there for some significant time before any movement. >>>> >>>> In the stage detail of the UI, I can see that there are 127 tasks running >>>> but the duration each is at least a few minutes. >>>> >>>> I'm working off local storage (not hdfs) and the kmeans data is about >>>> 6.5GB (50M rows). >>>> >>>> Is this a normal behaviour? >>>> >>>> Thanks! >>