Re: high GC in the Kmeans algorithm

2015-02-20 Thread Xiangrui Meng
A single vector of size 10^7 won't hit that bound. How many clusters did you set? The broadcast variable size is 10^7 * k and you can calculate the amount of memory it needs. Try to reduce the number of tasks and see whether it helps. -Xiangrui On Tue, Feb 17, 2015 at 7:20 PM, lihu

Re: high GC in the Kmeans algorithm

2015-02-17 Thread lihu
Thanks for your answer. Yes, I cached the data, I can observed from the WebUI that all the data is cached in the memory. What I worry is that the dimension, not the total size. Sean Owen ever answered me that the Broadcast support the maximum array size is 2GB, so 10^7 is a little huge? On

Re: high GC in the Kmeans algorithm

2015-02-17 Thread Xiangrui Meng
Did you cache the data? Was it fully cached? The k-means implementation doesn't create many temporary objects. I guess you need more RAM to avoid GC triggered frequently. Please monitor the memory usage using YourKit or VisualVM. -Xiangrui On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com

Re: high GC in the Kmeans algorithm

2015-02-11 Thread Sean Owen
Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you have 24 cores? On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote: I give 50GB to the executor, so it seem that there is no reason the memory is not enough. On Wed, Feb 11, 2015

Re: high GC in the Kmeans algorithm

2015-02-11 Thread lihu
I just want to make the best use of CPU, and test the performance of spark if there is a lot of task in a single node. On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote: Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you

high GC in the Kmeans algorithm

2015-02-11 Thread lihu
Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I