A single vector of size 10^7 won't hit that bound. How many clusters
did you set? The broadcast variable size is 10^7 * k and you can
calculate the amount of memory it needs. Try to reduce the number of
tasks and see whether it helps. -Xiangrui
On Tue, Feb 17, 2015 at 7:20 PM, lihu
Thanks for your answer. Yes, I cached the data, I can observed from the
WebUI that all the data is cached in the memory.
What I worry is that the dimension, not the total size.
Sean Owen ever answered me that the Broadcast support the maximum array
size is 2GB, so 10^7 is a little huge?
On
Did you cache the data? Was it fully cached? The k-means
implementation doesn't create many temporary objects. I guess you need
more RAM to avoid GC triggered frequently. Please monitor the memory
usage using YourKit or VisualVM. -Xiangrui
On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com
Good, worth double-checking that's what you got. That's barely 1GB per
task though. Why run 48 if you have 24 cores?
On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote:
I give 50GB to the executor, so it seem that there is no reason the memory
is not enough.
On Wed, Feb 11, 2015
I just want to make the best use of CPU, and test the performance of spark
if there is a lot of task in a single node.
On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote:
Good, worth double-checking that's what you got. That's barely 1GB per
task though. Why run 48 if you
Hi,
I run the kmeans(MLlib) in a cluster with 12 workers. Every work own
a 128G RAM, 24Core. I run 48 task in one machine. the total data is just
40GB.
When the dimension of the data set is about 10^7, for every task the
duration is about 30s, but the cost for GC is about 20s.
When I