Re: high GC in the Kmeans algorithm

2015-02-20 Thread Xiangrui Meng
A single vector of size 10^7 won't hit that bound. How many clusters
did you set? The broadcast variable size is 10^7 * k and you can
calculate the amount of memory it needs. Try to reduce the number of
tasks and see whether it helps. -Xiangrui

On Tue, Feb 17, 2015 at 7:20 PM, lihu lihu...@gmail.com wrote:
 Thanks for your answer. Yes, I cached the data, I can observed from the
 WebUI that all the data is cached in the memory.

 What I worry is that the dimension,  not the total size.

 Sean Owen ever answered me that the Broadcast support the maximum array size
 is 2GB, so 10^7 is a little huge?

 On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng men...@gmail.com wrote:

 Did you cache the data? Was it fully cached? The k-means
 implementation doesn't create many temporary objects. I guess you need
 more RAM to avoid GC triggered frequently. Please monitor the memory
 usage using YourKit or VisualVM. -Xiangrui

 On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com wrote:
  I just want to make the best use of CPU,  and test the performance of
  spark
  if there is a lot of task in a single node.
 
  On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote:
 
  Good, worth double-checking that's what you got. That's barely 1GB per
  task though. Why run 48 if you have 24 cores?
 
  On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote:
   I give 50GB to the executor,  so it seem that  there is no reason the
   memory
   is not enough.
  
   On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com
   wrote:
  
   Meaning, you have 128GB per machine but how much memory are you
   giving
   the executors?
  
   On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote:
What do you mean?  Yes,I an see there  is some data put in the
memory
from
the web ui.
   
On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com
wrote:
   
Are you actually using that memory for executors?
   
On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
 Hi,
 I  run the kmeans(MLlib) in a cluster with 12 workers.
 Every
 work
 own a
 128G RAM, 24Core. I run 48 task in one machine. the total data
 is
 just
 40GB.

When the dimension of the data set is about 10^7, for every
 task
 the
 duration is about 30s, but the cost for GC is about 20s.

When I reduce the dimension to 10^4, then the gc is small.

 So why gc is so high when the dimension is larger? or this
 is
 the
 reason
 caused by MLlib?




   
   
   
   
--
Best Wishes!
   
Li Hu(李浒) | Graduate Student
Institute for Interdisciplinary Information Sciences(IIIS)
Tsinghua University, China
   
Email: lihu...@gmail.com
Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
   
   
  
  
  
  
   --
   Best Wishes!
  
   Li Hu(李浒) | Graduate Student
   Institute for Interdisciplinary Information Sciences(IIIS)
   Tsinghua University, China
  
   Email: lihu...@gmail.com
   Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
  
  
 
 
 
 
  --
  Best Wishes!
 
  Li Hu(李浒) | Graduate Student
  Institute for Interdisciplinary Information Sciences(IIIS)
  Tsinghua University, China
 
  Email: lihu...@gmail.com
  Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
 
 





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: high GC in the Kmeans algorithm

2015-02-17 Thread lihu
Thanks for your answer. Yes, I cached the data, I can observed from the
WebUI that all the data is cached in the memory.

What I worry is that the dimension,  not the total size.

Sean Owen ever answered me that the Broadcast support the maximum array
size is 2GB, so 10^7 is a little huge?

On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng men...@gmail.com wrote:

 Did you cache the data? Was it fully cached? The k-means
 implementation doesn't create many temporary objects. I guess you need
 more RAM to avoid GC triggered frequently. Please monitor the memory
 usage using YourKit or VisualVM. -Xiangrui

 On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com wrote:
  I just want to make the best use of CPU,  and test the performance of
 spark
  if there is a lot of task in a single node.
 
  On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote:
 
  Good, worth double-checking that's what you got. That's barely 1GB per
  task though. Why run 48 if you have 24 cores?
 
  On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote:
   I give 50GB to the executor,  so it seem that  there is no reason the
   memory
   is not enough.
  
   On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com
 wrote:
  
   Meaning, you have 128GB per machine but how much memory are you
 giving
   the executors?
  
   On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote:
What do you mean?  Yes,I an see there  is some data put in the
 memory
from
the web ui.
   
On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com
wrote:
   
Are you actually using that memory for executors?
   
On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
 Hi,
 I  run the kmeans(MLlib) in a cluster with 12 workers.
 Every
 work
 own a
 128G RAM, 24Core. I run 48 task in one machine. the total data
 is
 just
 40GB.

When the dimension of the data set is about 10^7, for every
 task
 the
 duration is about 30s, but the cost for GC is about 20s.

When I reduce the dimension to 10^4, then the gc is small.

 So why gc is so high when the dimension is larger? or this
 is
 the
 reason
 caused by MLlib?




   
   
   
   
--
Best Wishes!
   
Li Hu(李浒) | Graduate Student
Institute for Interdisciplinary Information Sciences(IIIS)
Tsinghua University, China
   
Email: lihu...@gmail.com
Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
   
   
  
  
  
  
   --
   Best Wishes!
  
   Li Hu(李浒) | Graduate Student
   Institute for Interdisciplinary Information Sciences(IIIS)
   Tsinghua University, China
  
   Email: lihu...@gmail.com
   Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
  
  
 
 
 
 
  --
  Best Wishes!
 
  Li Hu(李浒) | Graduate Student
  Institute for Interdisciplinary Information Sciences(IIIS)
  Tsinghua University, China
 
  Email: lihu...@gmail.com
  Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
 
 



Re: high GC in the Kmeans algorithm

2015-02-17 Thread Xiangrui Meng
Did you cache the data? Was it fully cached? The k-means
implementation doesn't create many temporary objects. I guess you need
more RAM to avoid GC triggered frequently. Please monitor the memory
usage using YourKit or VisualVM. -Xiangrui

On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com wrote:
 I just want to make the best use of CPU,  and test the performance of spark
 if there is a lot of task in a single node.

 On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote:

 Good, worth double-checking that's what you got. That's barely 1GB per
 task though. Why run 48 if you have 24 cores?

 On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote:
  I give 50GB to the executor,  so it seem that  there is no reason the
  memory
  is not enough.
 
  On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote:
 
  Meaning, you have 128GB per machine but how much memory are you giving
  the executors?
 
  On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote:
   What do you mean?  Yes,I an see there  is some data put in the memory
   from
   the web ui.
  
   On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com
   wrote:
  
   Are you actually using that memory for executors?
  
   On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
Hi,
I  run the kmeans(MLlib) in a cluster with 12 workers.  Every
work
own a
128G RAM, 24Core. I run 48 task in one machine. the total data is
just
40GB.
   
   When the dimension of the data set is about 10^7, for every
task
the
duration is about 30s, but the cost for GC is about 20s.
   
   When I reduce the dimension to 10^4, then the gc is small.
   
So why gc is so high when the dimension is larger? or this is
the
reason
caused by MLlib?
   
   
   
   
  
  
  
  
   --
   Best Wishes!
  
   Li Hu(李浒) | Graduate Student
   Institute for Interdisciplinary Information Sciences(IIIS)
   Tsinghua University, China
  
   Email: lihu...@gmail.com
   Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
  
  
 
 
 
 
  --
  Best Wishes!
 
  Li Hu(李浒) | Graduate Student
  Institute for Interdisciplinary Information Sciences(IIIS)
  Tsinghua University, China
 
  Email: lihu...@gmail.com
  Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
 
 




 --
 Best Wishes!

 Li Hu(李浒) | Graduate Student
 Institute for Interdisciplinary Information Sciences(IIIS)
 Tsinghua University, China

 Email: lihu...@gmail.com
 Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: high GC in the Kmeans algorithm

2015-02-11 Thread Sean Owen
Good, worth double-checking that's what you got. That's barely 1GB per
task though. Why run 48 if you have 24 cores?

On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote:
 I give 50GB to the executor,  so it seem that  there is no reason the memory
 is not enough.

 On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote:

 Meaning, you have 128GB per machine but how much memory are you giving
 the executors?

 On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote:
  What do you mean?  Yes,I an see there  is some data put in the memory
  from
  the web ui.
 
  On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote:
 
  Are you actually using that memory for executors?
 
  On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
   Hi,
   I  run the kmeans(MLlib) in a cluster with 12 workers.  Every
   work
   own a
   128G RAM, 24Core. I run 48 task in one machine. the total data is
   just
   40GB.
  
  When the dimension of the data set is about 10^7, for every task
   the
   duration is about 30s, but the cost for GC is about 20s.
  
  When I reduce the dimension to 10^4, then the gc is small.
  
   So why gc is so high when the dimension is larger? or this is the
   reason
   caused by MLlib?
  
  
  
  
 
 
 
 
  --
  Best Wishes!
 
  Li Hu(李浒) | Graduate Student
  Institute for Interdisciplinary Information Sciences(IIIS)
  Tsinghua University, China
 
  Email: lihu...@gmail.com
  Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
 
 




 --
 Best Wishes!

 Li Hu(李浒) | Graduate Student
 Institute for Interdisciplinary Information Sciences(IIIS)
 Tsinghua University, China

 Email: lihu...@gmail.com
 Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: high GC in the Kmeans algorithm

2015-02-11 Thread lihu
I just want to make the best use of CPU,  and test the performance of spark
if there is a lot of task in a single node.

On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote:

 Good, worth double-checking that's what you got. That's barely 1GB per
 task though. Why run 48 if you have 24 cores?

 On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote:
  I give 50GB to the executor,  so it seem that  there is no reason the
 memory
  is not enough.
 
  On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote:
 
  Meaning, you have 128GB per machine but how much memory are you giving
  the executors?
 
  On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote:
   What do you mean?  Yes,I an see there  is some data put in the memory
   from
   the web ui.
  
   On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com
 wrote:
  
   Are you actually using that memory for executors?
  
   On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
Hi,
I  run the kmeans(MLlib) in a cluster with 12 workers.  Every
work
own a
128G RAM, 24Core. I run 48 task in one machine. the total data is
just
40GB.
   
   When the dimension of the data set is about 10^7, for every task
the
duration is about 30s, but the cost for GC is about 20s.
   
   When I reduce the dimension to 10^4, then the gc is small.
   
So why gc is so high when the dimension is larger? or this is
 the
reason
caused by MLlib?
   
   
   
   
  
  
  
  
   --
   Best Wishes!
  
   Li Hu(李浒) | Graduate Student
   Institute for Interdisciplinary Information Sciences(IIIS)
   Tsinghua University, China
  
   Email: lihu...@gmail.com
   Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
  
  
 
 
 
 
  --
  Best Wishes!
 
  Li Hu(李浒) | Graduate Student
  Institute for Interdisciplinary Information Sciences(IIIS)
  Tsinghua University, China
 
  Email: lihu...@gmail.com
  Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
 
 




-- 
*Best Wishes!*

*Li Hu(李浒) | Graduate Student*

*Institute for Interdisciplinary Information Sciences(IIIS
http://iiis.tsinghua.edu.cn/)*
*Tsinghua University, China*

*Email: lihu...@gmail.com lihu...@gmail.com*
*Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
http://iiis.tsinghua.edu.cn/zh/lihu/*


high GC in the Kmeans algorithm

2015-02-11 Thread lihu
Hi,
I  run the kmeans(MLlib) in a cluster with 12 workers.  Every work own
a 128G RAM, 24Core. I run 48 task in one machine. the total data is just
40GB.

   When the dimension of the data set is about 10^7, for every task the
duration is about 30s, but the cost for GC is about 20s.

   When I reduce the dimension to 10^4, then the gc is small.

So why gc is so high when the dimension is larger? or this is the
reason caused by MLlib?