Thanks for your answer. Yes, I cached the data, I can observed from the WebUI that all the data is cached in the memory.
What I worry is that the dimension, not the total size. Sean Owen ever answered me that the Broadcast support the maximum array size is 2GB, so 10^7 is a little huge? On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng <men...@gmail.com> wrote: > Did you cache the data? Was it fully cached? The k-means > implementation doesn't create many temporary objects. I guess you need > more RAM to avoid GC triggered frequently. Please monitor the memory > usage using YourKit or VisualVM. -Xiangrui > > On Wed, Feb 11, 2015 at 1:35 AM, lihu <lihu...@gmail.com> wrote: > > I just want to make the best use of CPU, and test the performance of > spark > > if there is a lot of task in a single node. > > > > On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen <so...@cloudera.com> wrote: > >> > >> Good, worth double-checking that's what you got. That's barely 1GB per > >> task though. Why run 48 if you have 24 cores? > >> > >> On Wed, Feb 11, 2015 at 9:03 AM, lihu <lihu...@gmail.com> wrote: > >> > I give 50GB to the executor, so it seem that there is no reason the > >> > memory > >> > is not enough. > >> > > >> > On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen <so...@cloudera.com> > wrote: > >> >> > >> >> Meaning, you have 128GB per machine but how much memory are you > giving > >> >> the executors? > >> >> > >> >> On Wed, Feb 11, 2015 at 8:49 AM, lihu <lihu...@gmail.com> wrote: > >> >> > What do you mean? Yes,I an see there is some data put in the > memory > >> >> > from > >> >> > the web ui. > >> >> > > >> >> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen <so...@cloudera.com> > >> >> > wrote: > >> >> >> > >> >> >> Are you actually using that memory for executors? > >> >> >> > >> >> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu <lihu...@gmail.com> wrote: > >> >> >> > Hi, > >> >> >> > I run the kmeans(MLlib) in a cluster with 12 workers. > Every > >> >> >> > work > >> >> >> > own a > >> >> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data > is > >> >> >> > just > >> >> >> > 40GB. > >> >> >> > > >> >> >> > When the dimension of the data set is about 10^7, for every > >> >> >> > task > >> >> >> > the > >> >> >> > duration is about 30s, but the cost for GC is about 20s. > >> >> >> > > >> >> >> > When I reduce the dimension to 10^4, then the gc is small. > >> >> >> > > >> >> >> > So why gc is so high when the dimension is larger? or this > is > >> >> >> > the > >> >> >> > reason > >> >> >> > caused by MLlib? > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Best Wishes! > >> >> > > >> >> > Li Hu(李浒) | Graduate Student > >> >> > Institute for Interdisciplinary Information Sciences(IIIS) > >> >> > Tsinghua University, China > >> >> > > >> >> > Email: lihu...@gmail.com > >> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ > >> >> > > >> >> > > >> > > >> > > >> > > >> > > >> > -- > >> > Best Wishes! > >> > > >> > Li Hu(李浒) | Graduate Student > >> > Institute for Interdisciplinary Information Sciences(IIIS) > >> > Tsinghua University, China > >> > > >> > Email: lihu...@gmail.com > >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ > >> > > >> > > > > > > > > > > > -- > > Best Wishes! > > > > Li Hu(李浒) | Graduate Student > > Institute for Interdisciplinary Information Sciences(IIIS) > > Tsinghua University, China > > > > Email: lihu...@gmail.com > > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ > > > > >