Re: Kmeans example reduceByKey slow
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming wrote: > Thanks again. > >> If you use the KMeans implementation from MLlib, the >> initialization stage is done on master, > > The "master" here is the app/driver/spark-shell? > > Thanks! > > On 25 Mar, 2014, at 1:03 am, Xiangrui Meng wrote: > >> Number of rows doesn't matter much as long as you have enough workers >> to distribute the work. K-means has complexity O(n * d * k), where n >> is number of points, d is the dimension, and k is the number of >> clusters. If you use the KMeans implementation from MLlib, the >> initialization stage is done on master, so a large k would slow down >> the initialization stage. If your data is sparse, the latest change to >> KMeans will help with the speed, depending on how sparse your data is. >> -Xiangrui >> >> On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming wrote: >>> Thanks, Let me try with a smaller K. >>> >>> Does the size of the input data matters for the example? Currently I have >>> 50M rows. What is a reasonable size to demonstrate the capability of Spark? >>> >>> >>> >>> >>> >>> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng wrote: >>> K = 50 is certainly a large number for k-means. If there is no particular reason to have 50 clusters, could you try to reduce it to, e.g, 100 or 1000? Also, the example code is not for large-scale problems. You should use the KMeans algorithm in mllib clustering for your problem. -Xiangrui On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming wrote: > Hi, > > This is on a 4 nodes cluster each with 32 cores/256GB Ram. > > (0.9.0) is deployed in a stand alone mode. > > Each worker is configured with 192GB. Spark executor memory is also 192GB. > > This is on the first iteration. K=50. Here's the code I use: > http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. > > Thanks! > > > > On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng wrote: > >> Hi Tsai, >> >> Could you share more information about the machine you used and the >> training parameters (runs, k, and iterations)? It can help solve your >> issues. Thanks! >> >> Best, >> Xiangrui >> >> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming >> wrote: >>> Hi, >>> >>> At the reduceBuyKey stage, it takes a few minutes before the tasks >>> start working. >>> >>> I have -Dspark.default.parallelism=127 cores (n-1). >>> >>> CPU/Network/IO is idling across all nodes when this is happening. >>> >>> And there is nothing particular on the master log file. From the >>> spark-shell: >>> >>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 >>> on executor 2: XXX (PROCESS_LOCAL) >>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as >>> 38765155 bytes in 193 ms >>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 >>> on executor 1: XXX (PROCESS_LOCAL) >>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as >>> 38765155 bytes in 96 ms >>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 >>> on executor 0: XXX (PROCESS_LOCAL) >>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as >>> 38765155 bytes in 100 ms >>> >>> But it stops there for some significant time before any movement. >>> >>> In the stage detail of the UI, I can see that there are 127 tasks >>> running but the duration each is at least a few minutes. >>> >>> I'm working off local storage (not hdfs) and the kmeans data is about >>> 6.5GB (50M rows). >>> >>> Is this a normal behaviour? >>> >>> Thanks! > >>> >
Re: Kmeans example reduceByKey slow
Thanks again. > If you use the KMeans implementation from MLlib, the > initialization stage is done on master, The “master” here is the app/driver/spark-shell? Thanks! On 25 Mar, 2014, at 1:03 am, Xiangrui Meng wrote: > Number of rows doesn't matter much as long as you have enough workers > to distribute the work. K-means has complexity O(n * d * k), where n > is number of points, d is the dimension, and k is the number of > clusters. If you use the KMeans implementation from MLlib, the > initialization stage is done on master, so a large k would slow down > the initialization stage. If your data is sparse, the latest change to > KMeans will help with the speed, depending on how sparse your data is. > -Xiangrui > > On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming wrote: >> Thanks, Let me try with a smaller K. >> >> Does the size of the input data matters for the example? Currently I have >> 50M rows. What is a reasonable size to demonstrate the capability of Spark? >> >> >> >> >> >> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng wrote: >> >>> K = 50 is certainly a large number for k-means. If there is no >>> particular reason to have 50 clusters, could you try to reduce it >>> to, e.g, 100 or 1000? Also, the example code is not for large-scale >>> problems. You should use the KMeans algorithm in mllib clustering for >>> your problem. >>> >>> -Xiangrui >>> >>> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming >>> wrote: Hi, This is on a 4 nodes cluster each with 32 cores/256GB Ram. (0.9.0) is deployed in a stand alone mode. Each worker is configured with 192GB. Spark executor memory is also 192GB. This is on the first iteration. K=50. Here's the code I use: http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. Thanks! On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng wrote: > Hi Tsai, > > Could you share more information about the machine you used and the > training parameters (runs, k, and iterations)? It can help solve your > issues. Thanks! > > Best, > Xiangrui > > On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming > wrote: >> Hi, >> >> At the reduceBuyKey stage, it takes a few minutes before the tasks start >> working. >> >> I have -Dspark.default.parallelism=127 cores (n-1). >> >> CPU/Network/IO is idling across all nodes when this is happening. >> >> And there is nothing particular on the master log file. From the >> spark-shell: >> >> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 >> on executor 2: XXX (PROCESS_LOCAL) >> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as >> 38765155 bytes in 193 ms >> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 >> on executor 1: XXX (PROCESS_LOCAL) >> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as >> 38765155 bytes in 96 ms >> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 >> on executor 0: XXX (PROCESS_LOCAL) >> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as >> 38765155 bytes in 100 ms >> >> But it stops there for some significant time before any movement. >> >> In the stage detail of the UI, I can see that there are 127 tasks >> running but the duration each is at least a few minutes. >> >> I'm working off local storage (not hdfs) and the kmeans data is about >> 6.5GB (50M rows). >> >> Is this a normal behaviour? >> >> Thanks! >>
Re: Kmeans example reduceByKey slow
Number of rows doesn't matter much as long as you have enough workers to distribute the work. K-means has complexity O(n * d * k), where n is number of points, d is the dimension, and k is the number of clusters. If you use the KMeans implementation from MLlib, the initialization stage is done on master, so a large k would slow down the initialization stage. If your data is sparse, the latest change to KMeans will help with the speed, depending on how sparse your data is. -Xiangrui On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming wrote: > Thanks, Let me try with a smaller K. > > Does the size of the input data matters for the example? Currently I have 50M > rows. What is a reasonable size to demonstrate the capability of Spark? > > > > > > On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng wrote: > >> K = 50 is certainly a large number for k-means. If there is no >> particular reason to have 50 clusters, could you try to reduce it >> to, e.g, 100 or 1000? Also, the example code is not for large-scale >> problems. You should use the KMeans algorithm in mllib clustering for >> your problem. >> >> -Xiangrui >> >> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming wrote: >>> Hi, >>> >>> This is on a 4 nodes cluster each with 32 cores/256GB Ram. >>> >>> (0.9.0) is deployed in a stand alone mode. >>> >>> Each worker is configured with 192GB. Spark executor memory is also 192GB. >>> >>> This is on the first iteration. K=50. Here's the code I use: >>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. >>> >>> Thanks! >>> >>> >>> >>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng wrote: >>> Hi Tsai, Could you share more information about the machine you used and the training parameters (runs, k, and iterations)? It can help solve your issues. Thanks! Best, Xiangrui On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote: > Hi, > > At the reduceBuyKey stage, it takes a few minutes before the tasks start > working. > > I have -Dspark.default.parallelism=127 cores (n-1). > > CPU/Network/IO is idling across all nodes when this is happening. > > And there is nothing particular on the master log file. From the > spark-shell: > > 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 > on executor 2: XXX (PROCESS_LOCAL) > 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as > 38765155 bytes in 193 ms > 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 > on executor 1: XXX (PROCESS_LOCAL) > 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as > 38765155 bytes in 96 ms > 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 > on executor 0: XXX (PROCESS_LOCAL) > 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as > 38765155 bytes in 100 ms > > But it stops there for some significant time before any movement. > > In the stage detail of the UI, I can see that there are 127 tasks running > but the duration each is at least a few minutes. > > I'm working off local storage (not hdfs) and the kmeans data is about > 6.5GB (50M rows). > > Is this a normal behaviour? > > Thanks! >>> >
Re: Kmeans example reduceByKey slow
Thanks, Let me try with a smaller K. Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark? On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng wrote: > K = 50 is certainly a large number for k-means. If there is no > particular reason to have 50 clusters, could you try to reduce it > to, e.g, 100 or 1000? Also, the example code is not for large-scale > problems. You should use the KMeans algorithm in mllib clustering for > your problem. > > -Xiangrui > > On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming wrote: >> Hi, >> >> This is on a 4 nodes cluster each with 32 cores/256GB Ram. >> >> (0.9.0) is deployed in a stand alone mode. >> >> Each worker is configured with 192GB. Spark executor memory is also 192GB. >> >> This is on the first iteration. K=50. Here's the code I use: >> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. >> >> Thanks! >> >> >> >> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng wrote: >> >>> Hi Tsai, >>> >>> Could you share more information about the machine you used and the >>> training parameters (runs, k, and iterations)? It can help solve your >>> issues. Thanks! >>> >>> Best, >>> Xiangrui >>> >>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote: Hi, At the reduceBuyKey stage, it takes a few minutes before the tasks start working. I have -Dspark.default.parallelism=127 cores (n-1). CPU/Network/IO is idling across all nodes when this is happening. And there is nothing particular on the master log file. From the spark-shell: 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms But it stops there for some significant time before any movement. In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes. I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows). Is this a normal behaviour? Thanks! >>
Re: Kmeans example reduceByKey slow
K = 50 is certainly a large number for k-means. If there is no particular reason to have 50 clusters, could you try to reduce it to, e.g, 100 or 1000? Also, the example code is not for large-scale problems. You should use the KMeans algorithm in mllib clustering for your problem. -Xiangrui On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming wrote: > Hi, > > This is on a 4 nodes cluster each with 32 cores/256GB Ram. > > (0.9.0) is deployed in a stand alone mode. > > Each worker is configured with 192GB. Spark executor memory is also 192GB. > > This is on the first iteration. K=50. Here's the code I use: > http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. > > Thanks! > > > > On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng wrote: > >> Hi Tsai, >> >> Could you share more information about the machine you used and the >> training parameters (runs, k, and iterations)? It can help solve your >> issues. Thanks! >> >> Best, >> Xiangrui >> >> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote: >>> Hi, >>> >>> At the reduceBuyKey stage, it takes a few minutes before the tasks start >>> working. >>> >>> I have -Dspark.default.parallelism=127 cores (n-1). >>> >>> CPU/Network/IO is idling across all nodes when this is happening. >>> >>> And there is nothing particular on the master log file. From the >>> spark-shell: >>> >>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on >>> executor 2: XXX (PROCESS_LOCAL) >>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 >>> bytes in 193 ms >>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on >>> executor 1: XXX (PROCESS_LOCAL) >>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 >>> bytes in 96 ms >>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on >>> executor 0: XXX (PROCESS_LOCAL) >>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 >>> bytes in 100 ms >>> >>> But it stops there for some significant time before any movement. >>> >>> In the stage detail of the UI, I can see that there are 127 tasks running >>> but the duration each is at least a few minutes. >>> >>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB >>> (50M rows). >>> >>> Is this a normal behaviour? >>> >>> Thanks! >
Re: Kmeans example reduceByKey slow
Hi, This is on a 4 nodes cluster each with 32 cores/256GB Ram. (0.9.0) is deployed in a stand alone mode. Each worker is configured with 192GB. Spark executor memory is also 192GB. This is on the first iteration. K=50. Here’s the code I use: http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. Thanks! On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng wrote: > Hi Tsai, > > Could you share more information about the machine you used and the > training parameters (runs, k, and iterations)? It can help solve your > issues. Thanks! > > Best, > Xiangrui > > On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote: >> Hi, >> >> At the reduceBuyKey stage, it takes a few minutes before the tasks start >> working. >> >> I have -Dspark.default.parallelism=127 cores (n-1). >> >> CPU/Network/IO is idling across all nodes when this is happening. >> >> And there is nothing particular on the master log file. From the spark-shell: >> >> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on >> executor 2: XXX (PROCESS_LOCAL) >> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 >> bytes in 193 ms >> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on >> executor 1: XXX (PROCESS_LOCAL) >> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 >> bytes in 96 ms >> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on >> executor 0: XXX (PROCESS_LOCAL) >> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 >> bytes in 100 ms >> >> But it stops there for some significant time before any movement. >> >> In the stage detail of the UI, I can see that there are 127 tasks running >> but the duration each is at least a few minutes. >> >> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB >> (50M rows). >> >> Is this a normal behaviour? >> >> Thanks!
Re: Kmeans example reduceByKey slow
Hi Tsai, Could you share more information about the machine you used and the training parameters (runs, k, and iterations)? It can help solve your issues. Thanks! Best, Xiangrui On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote: > Hi, > > At the reduceBuyKey stage, it takes a few minutes before the tasks start > working. > > I have -Dspark.default.parallelism=127 cores (n-1). > > CPU/Network/IO is idling across all nodes when this is happening. > > And there is nothing particular on the master log file. From the spark-shell: > > 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on > executor 2: XXX (PROCESS_LOCAL) > 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 > bytes in 193 ms > 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on > executor 1: XXX (PROCESS_LOCAL) > 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 > bytes in 96 ms > 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on > executor 0: XXX (PROCESS_LOCAL) > 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 > bytes in 100 ms > > But it stops there for some significant time before any movement. > > In the stage detail of the UI, I can see that there are 127 tasks running but > the duration each is at least a few minutes. > > I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB > (50M rows). > > Is this a normal behaviour? > > Thanks!
Kmeans example reduceByKey slow
Hi, At the reduceBuyKey stage, it takes a few minutes before the tasks start working. I have -Dspark.default.parallelism=127 cores (n-1). CPU/Network/IO is idling across all nodes when this is happening. And there is nothing particular on the master log file. From the spark-shell: 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms But it stops there for some significant time before any movement. In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes. I’m working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows). Is this a normal behaviour? Thanks!