Hi Xiangrui,

We also run into this issue at Alpine Data Labs. We ended up using LRU
cache to store the counts, and splitting those least used counts to
distributed cache in HDFS.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng <men...@gmail.com> wrote:

> Even the features are sparse, the conditional probabilities are stored
> in a dense matrix. With 200 labels and 2 million features, you need to
> store at least 4e8 doubles on the driver node. With multiple
> partitions, you may need more memory on the driver. Could you try
> reducing the number of partitions and giving driver more ram and see
> whether it can help? -Xiangrui
>
> On Sun, Apr 27, 2014 at 3:33 PM, John King <usedforprinting...@gmail.com>
> wrote:
> > I'm already using the SparseVector class.
> >
> > ~200 labels
> >
> >
> > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng <men...@gmail.com>
> wrote:
> >>
> >> How many labels does your dataset have? -Xiangrui
> >>
> >> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai <dbt...@stanford.edu> wrote:
> >> > Which version of mllib are you using? For Spark 1.0, mllib will
> >> > support sparse feature vector which will improve performance a lot
> >> > when computing the distance between points and centroid.
> >> >
> >> > Sincerely,
> >> >
> >> > DB Tsai
> >> > -------------------------------------------------------
> >> > My Blog: https://www.dbtsai.com
> >> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >> >
> >> >
> >> > On Sat, Apr 26, 2014 at 5:49 AM, John King
> >> > <usedforprinting...@gmail.com> wrote:
> >> >> I'm just wondering are the SparkVector calculations really taking
> into
> >> >> account the sparsity or just converting to dense?
> >> >>
> >> >>
> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King
> >> >> <usedforprinting...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> I've been trying to use the Naive Bayes classifier. Each example in
> >> >>> the
> >> >>> dataset is about 2 million features, only about 20-50 of which are
> >> >>> non-zero,
> >> >>> so the vectors are very sparse. I keep running out of memory though,
> >> >>> even
> >> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4
> >> >>> million
> >> >>> examples. And I would also like to note that I'm using the sparse
> >> >>> vector
> >> >>> class.
> >> >>
> >> >>
> >
> >
>

Reply via email to