subject:"Kmeans example reduceByKey slow"

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng

Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui

On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming  wrote:
> Thanks again.
>
>> If you use the KMeans implementation from MLlib, the
>> initialization stage is done on master,
>
> The "master" here is the app/driver/spark-shell?
>
> Thanks!
>
> On 25 Mar, 2014, at 1:03 am, Xiangrui Meng  wrote:
>
>> Number of rows doesn't matter much as long as you have enough workers
>> to distribute the work. K-means has complexity O(n * d * k), where n
>> is number of points, d is the dimension, and k is the number of
>> clusters. If you use the KMeans implementation from MLlib, the
>> initialization stage is done on master, so a large k would slow down
>> the initialization stage. If your data is sparse, the latest change to
>> KMeans will help with the speed, depending on how sparse your data is.
>> -Xiangrui
>>
>> On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming  wrote:
>>> Thanks, Let me try with a smaller K.
>>>
>>> Does the size of the input data matters for the example? Currently I have 
>>> 50M rows. What is a reasonable size to demonstrate the capability of Spark?
>>>
>>>
>>>
>>>
>>>
>>> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng  wrote:
>>>
 K = 50 is certainly a large number for k-means. If there is no
 particular reason to have 50 clusters, could you try to reduce it
 to, e.g, 100 or 1000? Also, the example code is not for large-scale
 problems. You should use the KMeans algorithm in mllib clustering for
 your problem.

 -Xiangrui

 On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming  
 wrote:
> Hi,
>
> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>
> (0.9.0) is deployed in a stand alone mode.
>
> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>
> This is on the first iteration. K=50. Here's the code I use:
> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>
> Thanks!
>
>
>
> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:
>
>> Hi Tsai,
>>
>> Could you share more information about the machine you used and the
>> training parameters (runs, k, and iterations)? It can help solve your
>> issues. Thanks!
>>
>> Best,
>> Xiangrui
>>
>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  
>> wrote:
>>> Hi,
>>>
>>> At the reduceBuyKey stage, it takes a few minutes before the tasks 
>>> start working.
>>>
>>> I have -Dspark.default.parallelism=127 cores (n-1).
>>>
>>> CPU/Network/IO is idling across all nodes when this is happening.
>>>
>>> And there is nothing particular on the master log file. From the 
>>> spark-shell:
>>>
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 
>>> on executor 2: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 
>>> 38765155 bytes in 193 ms
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 
>>> on executor 1: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 
>>> 38765155 bytes in 96 ms
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 
>>> on executor 0: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 
>>> 38765155 bytes in 100 ms
>>>
>>> But it stops there for some significant time before any movement.
>>>
>>> In the stage detail of the UI, I can see that there are 127 tasks 
>>> running but the duration each is at least a few minutes.
>>>
>>> I'm working off local storage (not hdfs) and the kmeans data is about 
>>> 6.5GB (50M rows).
>>>
>>> Is this a normal behaviour?
>>>
>>> Thanks!
>
>>>
>

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming

Thanks again.

> If you use the KMeans implementation from MLlib, the
> initialization stage is done on master, 

The “master” here is the app/driver/spark-shell?

Thanks!

On 25 Mar, 2014, at 1:03 am, Xiangrui Meng  wrote:

> Number of rows doesn't matter much as long as you have enough workers
> to distribute the work. K-means has complexity O(n * d * k), where n
> is number of points, d is the dimension, and k is the number of
> clusters. If you use the KMeans implementation from MLlib, the
> initialization stage is done on master, so a large k would slow down
> the initialization stage. If your data is sparse, the latest change to
> KMeans will help with the speed, depending on how sparse your data is.
> -Xiangrui
> 
> On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming  wrote:
>> Thanks, Let me try with a smaller K.
>> 
>> Does the size of the input data matters for the example? Currently I have 
>> 50M rows. What is a reasonable size to demonstrate the capability of Spark?
>> 
>> 
>> 
>> 
>> 
>> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng  wrote:
>> 
>>> K = 50 is certainly a large number for k-means. If there is no
>>> particular reason to have 50 clusters, could you try to reduce it
>>> to, e.g, 100 or 1000? Also, the example code is not for large-scale
>>> problems. You should use the KMeans algorithm in mllib clustering for
>>> your problem.
>>> 
>>> -Xiangrui
>>> 
>>> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming  
>>> wrote:
 Hi,
 
 This is on a 4 nodes cluster each with 32 cores/256GB Ram.
 
 (0.9.0) is deployed in a stand alone mode.
 
 Each worker is configured with 192GB. Spark executor memory is also 192GB.
 
 This is on the first iteration. K=50. Here's the code I use:
 http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
 
 Thanks!
 
 
 
 On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:
 
> Hi Tsai,
> 
> Could you share more information about the machine you used and the
> training parameters (runs, k, and iterations)? It can help solve your
> issues. Thanks!
> 
> Best,
> Xiangrui
> 
> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  
> wrote:
>> Hi,
>> 
>> At the reduceBuyKey stage, it takes a few minutes before the tasks start 
>> working.
>> 
>> I have -Dspark.default.parallelism=127 cores (n-1).
>> 
>> CPU/Network/IO is idling across all nodes when this is happening.
>> 
>> And there is nothing particular on the master log file. From the 
>> spark-shell:
>> 
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 
>> on executor 2: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 
>> 38765155 bytes in 193 ms
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 
>> on executor 1: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 
>> 38765155 bytes in 96 ms
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 
>> on executor 0: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 
>> 38765155 bytes in 100 ms
>> 
>> But it stops there for some significant time before any movement.
>> 
>> In the stage detail of the UI, I can see that there are 127 tasks 
>> running but the duration each is at least a few minutes.
>> 
>> I'm working off local storage (not hdfs) and the kmeans data is about 
>> 6.5GB (50M rows).
>> 
>> Is this a normal behaviour?
>> 
>> Thanks!
 
>>

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng

Number of rows doesn't matter much as long as you have enough workers
to distribute the work. K-means has complexity O(n * d * k), where n
is number of points, d is the dimension, and k is the number of
clusters. If you use the KMeans implementation from MLlib, the
initialization stage is done on master, so a large k would slow down
the initialization stage. If your data is sparse, the latest change to
KMeans will help with the speed, depending on how sparse your data is.
-Xiangrui

On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming  wrote:
> Thanks, Let me try with a smaller K.
>
> Does the size of the input data matters for the example? Currently I have 50M 
> rows. What is a reasonable size to demonstrate the capability of Spark?
>
>
>
>
>
> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng  wrote:
>
>> K = 50 is certainly a large number for k-means. If there is no
>> particular reason to have 50 clusters, could you try to reduce it
>> to, e.g, 100 or 1000? Also, the example code is not for large-scale
>> problems. You should use the KMeans algorithm in mllib clustering for
>> your problem.
>>
>> -Xiangrui
>>
>> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming  wrote:
>>> Hi,
>>>
>>> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>>>
>>> (0.9.0) is deployed in a stand alone mode.
>>>
>>> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>>>
>>> This is on the first iteration. K=50. Here's the code I use:
>>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:
>>>
 Hi Tsai,

 Could you share more information about the machine you used and the
 training parameters (runs, k, and iterations)? It can help solve your
 issues. Thanks!

 Best,
 Xiangrui

 On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  
 wrote:
> Hi,
>
> At the reduceBuyKey stage, it takes a few minutes before the tasks start 
> working.
>
> I have -Dspark.default.parallelism=127 cores (n-1).
>
> CPU/Network/IO is idling across all nodes when this is happening.
>
> And there is nothing particular on the master log file. From the 
> spark-shell:
>
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 
> on executor 2: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 
> 38765155 bytes in 193 ms
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 
> on executor 1: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 
> 38765155 bytes in 96 ms
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 
> on executor 0: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 
> 38765155 bytes in 100 ms
>
> But it stops there for some significant time before any movement.
>
> In the stage detail of the UI, I can see that there are 127 tasks running 
> but the duration each is at least a few minutes.
>
> I'm working off local storage (not hdfs) and the kmeans data is about 
> 6.5GB (50M rows).
>
> Is this a normal behaviour?
>
> Thanks!
>>>
>

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming

Thanks, Let me try with a smaller K.

Does the size of the input data matters for the example? Currently I have 50M 
rows. What is a reasonable size to demonstrate the capability of Spark?

On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng  wrote:

> K = 50 is certainly a large number for k-means. If there is no
> particular reason to have 50 clusters, could you try to reduce it
> to, e.g, 100 or 1000? Also, the example code is not for large-scale
> problems. You should use the KMeans algorithm in mllib clustering for
> your problem.
> 
> -Xiangrui
> 
> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming  wrote:
>> Hi,
>> 
>> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>> 
>> (0.9.0) is deployed in a stand alone mode.
>> 
>> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>> 
>> This is on the first iteration. K=50. Here's the code I use:
>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>> 
>> Thanks!
>> 
>> 
>> 
>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:
>> 
>>> Hi Tsai,
>>> 
>>> Could you share more information about the machine you used and the
>>> training parameters (runs, k, and iterations)? It can help solve your
>>> issues. Thanks!
>>> 
>>> Best,
>>> Xiangrui
>>> 
>>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  wrote:
 Hi,

 At the reduceBuyKey stage, it takes a few minutes before the tasks start 
 working.

 I have -Dspark.default.parallelism=127 cores (n-1).

 CPU/Network/IO is idling across all nodes when this is happening.

 And there is nothing particular on the master log file. From the 
 spark-shell:

 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
 executor 2: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
 bytes in 193 ms
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
 executor 1: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
 bytes in 96 ms
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
 executor 0: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
 bytes in 100 ms

 But it stops there for some significant time before any movement.

 In the stage detail of the UI, I can see that there are 127 tasks running 
 but the duration each is at least a few minutes.

 I'm working off local storage (not hdfs) and the kmeans data is about 
 6.5GB (50M rows).

 Is this a normal behaviour?

 Thanks!
>>

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng

K = 50 is certainly a large number for k-means. If there is no
particular reason to have 50 clusters, could you try to reduce it
to, e.g, 100 or 1000? Also, the example code is not for large-scale
problems. You should use the KMeans algorithm in mllib clustering for
your problem.

-Xiangrui

On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming  wrote:
> Hi,
>
> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>
> (0.9.0) is deployed in a stand alone mode.
>
> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>
> This is on the first iteration. K=50. Here's the code I use:
> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>
> Thanks!
>
>
>
> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:
>
>> Hi Tsai,
>>
>> Could you share more information about the machine you used and the
>> training parameters (runs, k, and iterations)? It can help solve your
>> issues. Thanks!
>>
>> Best,
>> Xiangrui
>>
>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  wrote:
>>> Hi,
>>>
>>> At the reduceBuyKey stage, it takes a few minutes before the tasks start 
>>> working.
>>>
>>> I have -Dspark.default.parallelism=127 cores (n-1).
>>>
>>> CPU/Network/IO is idling across all nodes when this is happening.
>>>
>>> And there is nothing particular on the master log file. From the 
>>> spark-shell:
>>>
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
>>> executor 2: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
>>> bytes in 193 ms
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
>>> executor 1: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
>>> bytes in 96 ms
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
>>> executor 0: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
>>> bytes in 100 ms
>>>
>>> But it stops there for some significant time before any movement.
>>>
>>> In the stage detail of the UI, I can see that there are 127 tasks running 
>>> but the duration each is at least a few minutes.
>>>
>>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB 
>>> (50M rows).
>>>
>>> Is this a normal behaviour?
>>>
>>> Thanks!
>

Re: Kmeans example reduceByKey slow

2014-03-23 Thread Tsai Li Ming

Hi,

This is on a 4 nodes cluster each with 32 cores/256GB Ram. 

(0.9.0) is deployed in a stand alone mode.

Each worker is configured with 192GB. Spark executor memory is also 192GB. 

This is on the first iteration. K=50. Here’s the code I use:
http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.

Thanks!



On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:

> Hi Tsai,
> 
> Could you share more information about the machine you used and the
> training parameters (runs, k, and iterations)? It can help solve your
> issues. Thanks!
> 
> Best,
> Xiangrui
> 
> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  wrote:
>> Hi,
>> 
>> At the reduceBuyKey stage, it takes a few minutes before the tasks start 
>> working.
>> 
>> I have -Dspark.default.parallelism=127 cores (n-1).
>> 
>> CPU/Network/IO is idling across all nodes when this is happening.
>> 
>> And there is nothing particular on the master log file. From the spark-shell:
>> 
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
>> executor 2: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
>> bytes in 193 ms
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
>> executor 1: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
>> bytes in 96 ms
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
>> executor 0: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
>> bytes in 100 ms
>> 
>> But it stops there for some significant time before any movement.
>> 
>> In the stage detail of the UI, I can see that there are 127 tasks running 
>> but the duration each is at least a few minutes.
>> 
>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB 
>> (50M rows).
>> 
>> Is this a normal behaviour?
>> 
>> Thanks!

Re: Kmeans example reduceByKey slow

2014-03-23 Thread Xiangrui Meng

Hi Tsai,

Could you share more information about the machine you used and the
training parameters (runs, k, and iterations)? It can help solve your
issues. Thanks!

Best,
Xiangrui

On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  wrote:
> Hi,
>
> At the reduceBuyKey stage, it takes a few minutes before the tasks start 
> working.
>
> I have -Dspark.default.parallelism=127 cores (n-1).
>
> CPU/Network/IO is idling across all nodes when this is happening.
>
> And there is nothing particular on the master log file. From the spark-shell:
>
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
> executor 2: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
> bytes in 193 ms
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
> executor 1: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
> bytes in 96 ms
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
> executor 0: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
> bytes in 100 ms
>
> But it stops there for some significant time before any movement.
>
> In the stage detail of the UI, I can see that there are 127 tasks running but 
> the duration each is at least a few minutes.
>
> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB 
> (50M rows).
>
> Is this a normal behaviour?
>
> Thanks!

Kmeans example reduceByKey slow

2014-03-23 Thread Tsai Li Ming

Hi,

At the reduceBuyKey stage, it takes a few minutes before the tasks start 
working.

I have -Dspark.default.parallelism=127 cores (n-1).

CPU/Network/IO is idling across all nodes when this is happening. 

And there is nothing particular on the master log file. From the spark-shell:

14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
executor 2: XXX (PROCESS_LOCAL)
14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
bytes in 193 ms
14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
executor 1: XXX (PROCESS_LOCAL)
14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
bytes in 96 ms
14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
executor 0: XXX (PROCESS_LOCAL)
14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
bytes in 100 ms

But it stops there for some significant time before any movement. 

In the stage detail of the UI, I can see that there are 127 tasks running but 
the duration each is at least a few minutes.

I’m working off local storage (not hdfs) and the kmeans data is about 6.5GB 
(50M rows).

Is this a normal behaviour?

Thanks!

Re: Kmeans example reduceByKey slow

Re: Kmeans example reduceByKey slow

Re: Kmeans example reduceByKey slow

Re: Kmeans example reduceByKey slow

Re: Kmeans example reduceByKey slow

Re: Kmeans example reduceByKey slow

Re: Kmeans example reduceByKey slow

Kmeans example reduceByKey slow

8 matches

Site Navigation

Mail list logo

Footer information