Re: high GC in the Kmeans algorithm
A single vector of size 10^7 won't hit that bound. How many clusters did you set? The broadcast variable size is 10^7 * k and you can calculate the amount of memory it needs. Try to reduce the number of tasks and see whether it helps. -Xiangrui On Tue, Feb 17, 2015 at 7:20 PM, lihu lihu...@gmail.com wrote: Thanks for your answer. Yes, I cached the data, I can observed from the WebUI that all the data is cached in the memory. What I worry is that the dimension, not the total size. Sean Owen ever answered me that the Broadcast support the maximum array size is 2GB, so 10^7 is a little huge? On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng men...@gmail.com wrote: Did you cache the data? Was it fully cached? The k-means implementation doesn't create many temporary objects. I guess you need more RAM to avoid GC triggered frequently. Please monitor the memory usage using YourKit or VisualVM. -Xiangrui On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com wrote: I just want to make the best use of CPU, and test the performance of spark if there is a lot of task in a single node. On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote: Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you have 24 cores? On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote: I give 50GB to the executor, so it seem that there is no reason the memory is not enough. On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote: Meaning, you have 128GB per machine but how much memory are you giving the executors? On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote: What do you mean? Yes,I an see there is some data put in the memory from the web ui. On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote: Are you actually using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I reduce the dimension to 10^4, then the gc is small. So why gc is so high when the dimension is larger? or this is the reason caused by MLlib? -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: high GC in the Kmeans algorithm
Thanks for your answer. Yes, I cached the data, I can observed from the WebUI that all the data is cached in the memory. What I worry is that the dimension, not the total size. Sean Owen ever answered me that the Broadcast support the maximum array size is 2GB, so 10^7 is a little huge? On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng men...@gmail.com wrote: Did you cache the data? Was it fully cached? The k-means implementation doesn't create many temporary objects. I guess you need more RAM to avoid GC triggered frequently. Please monitor the memory usage using YourKit or VisualVM. -Xiangrui On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com wrote: I just want to make the best use of CPU, and test the performance of spark if there is a lot of task in a single node. On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote: Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you have 24 cores? On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote: I give 50GB to the executor, so it seem that there is no reason the memory is not enough. On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote: Meaning, you have 128GB per machine but how much memory are you giving the executors? On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote: What do you mean? Yes,I an see there is some data put in the memory from the web ui. On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote: Are you actually using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I reduce the dimension to 10^4, then the gc is small. So why gc is so high when the dimension is larger? or this is the reason caused by MLlib? -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
Re: high GC in the Kmeans algorithm
Did you cache the data? Was it fully cached? The k-means implementation doesn't create many temporary objects. I guess you need more RAM to avoid GC triggered frequently. Please monitor the memory usage using YourKit or VisualVM. -Xiangrui On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com wrote: I just want to make the best use of CPU, and test the performance of spark if there is a lot of task in a single node. On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote: Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you have 24 cores? On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote: I give 50GB to the executor, so it seem that there is no reason the memory is not enough. On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote: Meaning, you have 128GB per machine but how much memory are you giving the executors? On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote: What do you mean? Yes,I an see there is some data put in the memory from the web ui. On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote: Are you actually using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I reduce the dimension to 10^4, then the gc is small. So why gc is so high when the dimension is larger? or this is the reason caused by MLlib? -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: high GC in the Kmeans algorithm
Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you have 24 cores? On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote: I give 50GB to the executor, so it seem that there is no reason the memory is not enough. On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote: Meaning, you have 128GB per machine but how much memory are you giving the executors? On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote: What do you mean? Yes,I an see there is some data put in the memory from the web ui. On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote: Are you actually using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I reduce the dimension to 10^4, then the gc is small. So why gc is so high when the dimension is larger? or this is the reason caused by MLlib? -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: high GC in the Kmeans algorithm
I just want to make the best use of CPU, and test the performance of spark if there is a lot of task in a single node. On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote: Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you have 24 cores? On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote: I give 50GB to the executor, so it seem that there is no reason the memory is not enough. On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote: Meaning, you have 128GB per machine but how much memory are you giving the executors? On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote: What do you mean? Yes,I an see there is some data put in the memory from the web ui. On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote: Are you actually using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I reduce the dimension to 10^4, then the gc is small. So why gc is so high when the dimension is larger? or this is the reason caused by MLlib? -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- *Best Wishes!* *Li Hu(李浒) | Graduate Student* *Institute for Interdisciplinary Information Sciences(IIIS http://iiis.tsinghua.edu.cn/)* *Tsinghua University, China* *Email: lihu...@gmail.com lihu...@gmail.com* *Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ http://iiis.tsinghua.edu.cn/zh/lihu/*
high GC in the Kmeans algorithm
Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I reduce the dimension to 10^4, then the gc is small. So why gc is so high when the dimension is larger? or this is the reason caused by MLlib?