Thanks for your answer. Yes, I cached the data, I can observed from the
WebUI that all the data is cached in the memory.

What I worry is that the dimension,  not the total size.

Sean Owen ever answered me that the Broadcast support the maximum array
size is 2GB, so 10^7 is a little huge?

On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng <men...@gmail.com> wrote:

> Did you cache the data? Was it fully cached? The k-means
> implementation doesn't create many temporary objects. I guess you need
> more RAM to avoid GC triggered frequently. Please monitor the memory
> usage using YourKit or VisualVM. -Xiangrui
>
> On Wed, Feb 11, 2015 at 1:35 AM, lihu <lihu...@gmail.com> wrote:
> > I just want to make the best use of CPU,  and test the performance of
> spark
> > if there is a lot of task in a single node.
> >
> > On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Good, worth double-checking that's what you got. That's barely 1GB per
> >> task though. Why run 48 if you have 24 cores?
> >>
> >> On Wed, Feb 11, 2015 at 9:03 AM, lihu <lihu...@gmail.com> wrote:
> >> > I give 50GB to the executor,  so it seem that  there is no reason the
> >> > memory
> >> > is not enough.
> >> >
> >> > On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen <so...@cloudera.com>
> wrote:
> >> >>
> >> >> Meaning, you have 128GB per machine but how much memory are you
> giving
> >> >> the executors?
> >> >>
> >> >> On Wed, Feb 11, 2015 at 8:49 AM, lihu <lihu...@gmail.com> wrote:
> >> >> > What do you mean?  Yes,I an see there  is some data put in the
> memory
> >> >> > from
> >> >> > the web ui.
> >> >> >
> >> >> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen <so...@cloudera.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Are you actually using that memory for executors?
> >> >> >>
> >> >> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu <lihu...@gmail.com> wrote:
> >> >> >> > Hi,
> >> >> >> >     I  run the kmeans(MLlib) in a cluster with 12 workers.
> Every
> >> >> >> > work
> >> >> >> > own a
> >> >> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data
> is
> >> >> >> > just
> >> >> >> > 40GB.
> >> >> >> >
> >> >> >> >    When the dimension of the data set is about 10^7, for every
> >> >> >> > task
> >> >> >> > the
> >> >> >> > duration is about 30s, but the cost for GC is about 20s.
> >> >> >> >
> >> >> >> >    When I reduce the dimension to 10^4, then the gc is small.
> >> >> >> >
> >> >> >> >     So why gc is so high when the dimension is larger? or this
> is
> >> >> >> > the
> >> >> >> > reason
> >> >> >> > caused by MLlib?
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Best Wishes!
> >> >> >
> >> >> > Li Hu(李浒) | Graduate Student
> >> >> > Institute for Interdisciplinary Information Sciences(IIIS)
> >> >> > Tsinghua University, China
> >> >> >
> >> >> > Email: lihu...@gmail.com
> >> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> >> >> >
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best Wishes!
> >> >
> >> > Li Hu(李浒) | Graduate Student
> >> > Institute for Interdisciplinary Information Sciences(IIIS)
> >> > Tsinghua University, China
> >> >
> >> > Email: lihu...@gmail.com
> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Best Wishes!
> >
> > Li Hu(李浒) | Graduate Student
> > Institute for Interdisciplinary Information Sciences(IIIS)
> > Tsinghua University, China
> >
> > Email: lihu...@gmail.com
> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> >
> >
>

Reply via email to