Not sure if this is always ideal for Naive Bayes, but you could also hash the features into a lower-dimensional space (e.g. reduce it to 50,000 features). For each feature simply take MurmurHash3(featureID) % 50000 for example.
Matei On Apr 27, 2014, at 11:24 PM, DB Tsai <dbt...@stanford.edu> wrote: > Our customer asked us to implement Naive Bayes which should be able to at > least train news20 one year ago, and we implemented for them in Hadoop using > distributed cache to store the model. > > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sun, Apr 27, 2014 at 11:03 PM, Xiangrui Meng <men...@gmail.com> wrote: > How big is your problem and how many labels? -Xiangrui > > On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai <dbt...@stanford.edu> wrote: > > Hi Xiangrui, > > > > We also run into this issue at Alpine Data Labs. We ended up using LRU cache > > to store the counts, and splitting those least used counts to distributed > > cache in HDFS. > > > > > > Sincerely, > > > > DB Tsai > > ------------------------------------------------------- > > My Blog: https://www.dbtsai.com > > LinkedIn: https://www.linkedin.com/in/dbtsai > > > > > > On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng <men...@gmail.com> wrote: > >> > >> Even the features are sparse, the conditional probabilities are stored > >> in a dense matrix. With 200 labels and 2 million features, you need to > >> store at least 4e8 doubles on the driver node. With multiple > >> partitions, you may need more memory on the driver. Could you try > >> reducing the number of partitions and giving driver more ram and see > >> whether it can help? -Xiangrui > >> > >> On Sun, Apr 27, 2014 at 3:33 PM, John King <usedforprinting...@gmail.com> > >> wrote: > >> > I'm already using the SparseVector class. > >> > > >> > ~200 labels > >> > > >> > > >> > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng <men...@gmail.com> > >> > wrote: > >> >> > >> >> How many labels does your dataset have? -Xiangrui > >> >> > >> >> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai <dbt...@stanford.edu> wrote: > >> >> > Which version of mllib are you using? For Spark 1.0, mllib will > >> >> > support sparse feature vector which will improve performance a lot > >> >> > when computing the distance between points and centroid. > >> >> > > >> >> > Sincerely, > >> >> > > >> >> > DB Tsai > >> >> > ------------------------------------------------------- > >> >> > My Blog: https://www.dbtsai.com > >> >> > LinkedIn: https://www.linkedin.com/in/dbtsai > >> >> > > >> >> > > >> >> > On Sat, Apr 26, 2014 at 5:49 AM, John King > >> >> > <usedforprinting...@gmail.com> wrote: > >> >> >> I'm just wondering are the SparkVector calculations really taking > >> >> >> into > >> >> >> account the sparsity or just converting to dense? > >> >> >> > >> >> >> > >> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King > >> >> >> <usedforprinting...@gmail.com> > >> >> >> wrote: > >> >> >>> > >> >> >>> I've been trying to use the Naive Bayes classifier. Each example in > >> >> >>> the > >> >> >>> dataset is about 2 million features, only about 20-50 of which are > >> >> >>> non-zero, > >> >> >>> so the vectors are very sparse. I keep running out of memory > >> >> >>> though, > >> >> >>> even > >> >> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 > >> >> >>> million > >> >> >>> examples. And I would also like to note that I'm using the sparse > >> >> >>> vector > >> >> >>> class. > >> >> >> > >> >> >> > >> > > >> > > > > > >