How big is your problem and how many labels? -Xiangrui
On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai <dbt...@stanford.edu> wrote: > Hi Xiangrui, > > We also run into this issue at Alpine Data Labs. We ended up using LRU cache > to store the counts, and splitting those least used counts to distributed > cache in HDFS. > > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >> Even the features are sparse, the conditional probabilities are stored >> in a dense matrix. With 200 labels and 2 million features, you need to >> store at least 4e8 doubles on the driver node. With multiple >> partitions, you may need more memory on the driver. Could you try >> reducing the number of partitions and giving driver more ram and see >> whether it can help? -Xiangrui >> >> On Sun, Apr 27, 2014 at 3:33 PM, John King <usedforprinting...@gmail.com> >> wrote: >> > I'm already using the SparseVector class. >> > >> > ~200 labels >> > >> > >> > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng <men...@gmail.com> >> > wrote: >> >> >> >> How many labels does your dataset have? -Xiangrui >> >> >> >> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai <dbt...@stanford.edu> wrote: >> >> > Which version of mllib are you using? For Spark 1.0, mllib will >> >> > support sparse feature vector which will improve performance a lot >> >> > when computing the distance between points and centroid. >> >> > >> >> > Sincerely, >> >> > >> >> > DB Tsai >> >> > ------------------------------------------------------- >> >> > My Blog: https://www.dbtsai.com >> >> > LinkedIn: https://www.linkedin.com/in/dbtsai >> >> > >> >> > >> >> > On Sat, Apr 26, 2014 at 5:49 AM, John King >> >> > <usedforprinting...@gmail.com> wrote: >> >> >> I'm just wondering are the SparkVector calculations really taking >> >> >> into >> >> >> account the sparsity or just converting to dense? >> >> >> >> >> >> >> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King >> >> >> <usedforprinting...@gmail.com> >> >> >> wrote: >> >> >>> >> >> >>> I've been trying to use the Naive Bayes classifier. Each example in >> >> >>> the >> >> >>> dataset is about 2 million features, only about 20-50 of which are >> >> >>> non-zero, >> >> >>> so the vectors are very sparse. I keep running out of memory >> >> >>> though, >> >> >>> even >> >> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 >> >> >>> million >> >> >>> examples. And I would also like to note that I'm using the sparse >> >> >>> vector >> >> >>> class. >> >> >> >> >> >> >> > >> > > >