Re: Running out of memory Naive Bayes
Not sure if this is always ideal for Naive Bayes, but you could also hash the features into a lower-dimensional space (e.g. reduce it to 50,000 features). For each feature simply take MurmurHash3(featureID) % 5 for example. Matei On Apr 27, 2014, at 11:24 PM, DB Tsai wrote: > Our customer asked us to implement Naive Bayes which should be able to at > least train news20 one year ago, and we implemented for them in Hadoop using > distributed cache to store the model. > > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sun, Apr 27, 2014 at 11:03 PM, Xiangrui Meng wrote: > How big is your problem and how many labels? -Xiangrui > > On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai wrote: > > Hi Xiangrui, > > > > We also run into this issue at Alpine Data Labs. We ended up using LRU cache > > to store the counts, and splitting those least used counts to distributed > > cache in HDFS. > > > > > > Sincerely, > > > > DB Tsai > > --- > > My Blog: https://www.dbtsai.com > > LinkedIn: https://www.linkedin.com/in/dbtsai > > > > > > On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng wrote: > >> > >> Even the features are sparse, the conditional probabilities are stored > >> in a dense matrix. With 200 labels and 2 million features, you need to > >> store at least 4e8 doubles on the driver node. With multiple > >> partitions, you may need more memory on the driver. Could you try > >> reducing the number of partitions and giving driver more ram and see > >> whether it can help? -Xiangrui > >> > >> On Sun, Apr 27, 2014 at 3:33 PM, John King > >> wrote: > >> > I'm already using the SparseVector class. > >> > > >> > ~200 labels > >> > > >> > > >> > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng > >> > wrote: > >> >> > >> >> How many labels does your dataset have? -Xiangrui > >> >> > >> >> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote: > >> >> > Which version of mllib are you using? For Spark 1.0, mllib will > >> >> > support sparse feature vector which will improve performance a lot > >> >> > when computing the distance between points and centroid. > >> >> > > >> >> > Sincerely, > >> >> > > >> >> > DB Tsai > >> >> > --- > >> >> > My Blog: https://www.dbtsai.com > >> >> > LinkedIn: https://www.linkedin.com/in/dbtsai > >> >> > > >> >> > > >> >> > On Sat, Apr 26, 2014 at 5:49 AM, John King > >> >> > wrote: > >> >> >> I'm just wondering are the SparkVector calculations really taking > >> >> >> into > >> >> >> account the sparsity or just converting to dense? > >> >> >> > >> >> >> > >> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King > >> >> >> > >> >> >> wrote: > >> >> >>> > >> >> >>> I've been trying to use the Naive Bayes classifier. Each example in > >> >> >>> the > >> >> >>> dataset is about 2 million features, only about 20-50 of which are > >> >> >>> non-zero, > >> >> >>> so the vectors are very sparse. I keep running out of memory > >> >> >>> though, > >> >> >>> even > >> >> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 > >> >> >>> million > >> >> >>> examples. And I would also like to note that I'm using the sparse > >> >> >>> vector > >> >> >>> class. > >> >> >> > >> >> >> > >> > > >> > > > > > >
Re: Running out of memory Naive Bayes
Our customer asked us to implement Naive Bayes which should be able to at least train news20 one year ago, and we implemented for them in Hadoop using distributed cache to store the model. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Apr 27, 2014 at 11:03 PM, Xiangrui Meng wrote: > How big is your problem and how many labels? -Xiangrui > > On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai wrote: > > Hi Xiangrui, > > > > We also run into this issue at Alpine Data Labs. We ended up using LRU > cache > > to store the counts, and splitting those least used counts to distributed > > cache in HDFS. > > > > > > Sincerely, > > > > DB Tsai > > --- > > My Blog: https://www.dbtsai.com > > LinkedIn: https://www.linkedin.com/in/dbtsai > > > > > > On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng wrote: > >> > >> Even the features are sparse, the conditional probabilities are stored > >> in a dense matrix. With 200 labels and 2 million features, you need to > >> store at least 4e8 doubles on the driver node. With multiple > >> partitions, you may need more memory on the driver. Could you try > >> reducing the number of partitions and giving driver more ram and see > >> whether it can help? -Xiangrui > >> > >> On Sun, Apr 27, 2014 at 3:33 PM, John King < > usedforprinting...@gmail.com> > >> wrote: > >> > I'm already using the SparseVector class. > >> > > >> > ~200 labels > >> > > >> > > >> > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng > >> > wrote: > >> >> > >> >> How many labels does your dataset have? -Xiangrui > >> >> > >> >> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai > wrote: > >> >> > Which version of mllib are you using? For Spark 1.0, mllib will > >> >> > support sparse feature vector which will improve performance a lot > >> >> > when computing the distance between points and centroid. > >> >> > > >> >> > Sincerely, > >> >> > > >> >> > DB Tsai > >> >> > --- > >> >> > My Blog: https://www.dbtsai.com > >> >> > LinkedIn: https://www.linkedin.com/in/dbtsai > >> >> > > >> >> > > >> >> > On Sat, Apr 26, 2014 at 5:49 AM, John King > >> >> > wrote: > >> >> >> I'm just wondering are the SparkVector calculations really taking > >> >> >> into > >> >> >> account the sparsity or just converting to dense? > >> >> >> > >> >> >> > >> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King > >> >> >> > >> >> >> wrote: > >> >> >>> > >> >> >>> I've been trying to use the Naive Bayes classifier. Each example > in > >> >> >>> the > >> >> >>> dataset is about 2 million features, only about 20-50 of which > are > >> >> >>> non-zero, > >> >> >>> so the vectors are very sparse. I keep running out of memory > >> >> >>> though, > >> >> >>> even > >> >> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 > >> >> >>> million > >> >> >>> examples. And I would also like to note that I'm using the sparse > >> >> >>> vector > >> >> >>> class. > >> >> >> > >> >> >> > >> > > >> > > > > > >
Re: Running out of memory Naive Bayes
How big is your problem and how many labels? -Xiangrui On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai wrote: > Hi Xiangrui, > > We also run into this issue at Alpine Data Labs. We ended up using LRU cache > to store the counts, and splitting those least used counts to distributed > cache in HDFS. > > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng wrote: >> >> Even the features are sparse, the conditional probabilities are stored >> in a dense matrix. With 200 labels and 2 million features, you need to >> store at least 4e8 doubles on the driver node. With multiple >> partitions, you may need more memory on the driver. Could you try >> reducing the number of partitions and giving driver more ram and see >> whether it can help? -Xiangrui >> >> On Sun, Apr 27, 2014 at 3:33 PM, John King >> wrote: >> > I'm already using the SparseVector class. >> > >> > ~200 labels >> > >> > >> > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng >> > wrote: >> >> >> >> How many labels does your dataset have? -Xiangrui >> >> >> >> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote: >> >> > Which version of mllib are you using? For Spark 1.0, mllib will >> >> > support sparse feature vector which will improve performance a lot >> >> > when computing the distance between points and centroid. >> >> > >> >> > Sincerely, >> >> > >> >> > DB Tsai >> >> > --- >> >> > My Blog: https://www.dbtsai.com >> >> > LinkedIn: https://www.linkedin.com/in/dbtsai >> >> > >> >> > >> >> > On Sat, Apr 26, 2014 at 5:49 AM, John King >> >> > wrote: >> >> >> I'm just wondering are the SparkVector calculations really taking >> >> >> into >> >> >> account the sparsity or just converting to dense? >> >> >> >> >> >> >> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King >> >> >> >> >> >> wrote: >> >> >>> >> >> >>> I've been trying to use the Naive Bayes classifier. Each example in >> >> >>> the >> >> >>> dataset is about 2 million features, only about 20-50 of which are >> >> >>> non-zero, >> >> >>> so the vectors are very sparse. I keep running out of memory >> >> >>> though, >> >> >>> even >> >> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 >> >> >>> million >> >> >>> examples. And I would also like to note that I'm using the sparse >> >> >>> vector >> >> >>> class. >> >> >> >> >> >> >> > >> > > >
Re: Running out of memory Naive Bayes
Hi Xiangrui, We also run into this issue at Alpine Data Labs. We ended up using LRU cache to store the counts, and splitting those least used counts to distributed cache in HDFS. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng wrote: > Even the features are sparse, the conditional probabilities are stored > in a dense matrix. With 200 labels and 2 million features, you need to > store at least 4e8 doubles on the driver node. With multiple > partitions, you may need more memory on the driver. Could you try > reducing the number of partitions and giving driver more ram and see > whether it can help? -Xiangrui > > On Sun, Apr 27, 2014 at 3:33 PM, John King > wrote: > > I'm already using the SparseVector class. > > > > ~200 labels > > > > > > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng > wrote: > >> > >> How many labels does your dataset have? -Xiangrui > >> > >> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote: > >> > Which version of mllib are you using? For Spark 1.0, mllib will > >> > support sparse feature vector which will improve performance a lot > >> > when computing the distance between points and centroid. > >> > > >> > Sincerely, > >> > > >> > DB Tsai > >> > --- > >> > My Blog: https://www.dbtsai.com > >> > LinkedIn: https://www.linkedin.com/in/dbtsai > >> > > >> > > >> > On Sat, Apr 26, 2014 at 5:49 AM, John King > >> > wrote: > >> >> I'm just wondering are the SparkVector calculations really taking > into > >> >> account the sparsity or just converting to dense? > >> >> > >> >> > >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King > >> >> > >> >> wrote: > >> >>> > >> >>> I've been trying to use the Naive Bayes classifier. Each example in > >> >>> the > >> >>> dataset is about 2 million features, only about 20-50 of which are > >> >>> non-zero, > >> >>> so the vectors are very sparse. I keep running out of memory though, > >> >>> even > >> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 > >> >>> million > >> >>> examples. And I would also like to note that I'm using the sparse > >> >>> vector > >> >>> class. > >> >> > >> >> > > > > >
Re: Running out of memory Naive Bayes
Even the features are sparse, the conditional probabilities are stored in a dense matrix. With 200 labels and 2 million features, you need to store at least 4e8 doubles on the driver node. With multiple partitions, you may need more memory on the driver. Could you try reducing the number of partitions and giving driver more ram and see whether it can help? -Xiangrui On Sun, Apr 27, 2014 at 3:33 PM, John King wrote: > I'm already using the SparseVector class. > > ~200 labels > > > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng wrote: >> >> How many labels does your dataset have? -Xiangrui >> >> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote: >> > Which version of mllib are you using? For Spark 1.0, mllib will >> > support sparse feature vector which will improve performance a lot >> > when computing the distance between points and centroid. >> > >> > Sincerely, >> > >> > DB Tsai >> > --- >> > My Blog: https://www.dbtsai.com >> > LinkedIn: https://www.linkedin.com/in/dbtsai >> > >> > >> > On Sat, Apr 26, 2014 at 5:49 AM, John King >> > wrote: >> >> I'm just wondering are the SparkVector calculations really taking into >> >> account the sparsity or just converting to dense? >> >> >> >> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King >> >> >> >> wrote: >> >>> >> >>> I've been trying to use the Naive Bayes classifier. Each example in >> >>> the >> >>> dataset is about 2 million features, only about 20-50 of which are >> >>> non-zero, >> >>> so the vectors are very sparse. I keep running out of memory though, >> >>> even >> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 >> >>> million >> >>> examples. And I would also like to note that I'm using the sparse >> >>> vector >> >>> class. >> >> >> >> > >
Re: Running out of memory Naive Bayes
I'm already using the SparseVector class. ~200 labels On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng wrote: > How many labels does your dataset have? -Xiangrui > > On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote: > > Which version of mllib are you using? For Spark 1.0, mllib will > > support sparse feature vector which will improve performance a lot > > when computing the distance between points and centroid. > > > > Sincerely, > > > > DB Tsai > > --- > > My Blog: https://www.dbtsai.com > > LinkedIn: https://www.linkedin.com/in/dbtsai > > > > > > On Sat, Apr 26, 2014 at 5:49 AM, John King > wrote: > >> I'm just wondering are the SparkVector calculations really taking into > >> account the sparsity or just converting to dense? > >> > >> > >> On Fri, Apr 25, 2014 at 10:06 PM, John King < > usedforprinting...@gmail.com> > >> wrote: > >>> > >>> I've been trying to use the Naive Bayes classifier. Each example in the > >>> dataset is about 2 million features, only about 20-50 of which are > non-zero, > >>> so the vectors are very sparse. I keep running out of memory though, > even > >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 > million > >>> examples. And I would also like to note that I'm using the sparse > vector > >>> class. > >> > >> >
Re: Running out of memory Naive Bayes
How many labels does your dataset have? -Xiangrui On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote: > Which version of mllib are you using? For Spark 1.0, mllib will > support sparse feature vector which will improve performance a lot > when computing the distance between points and centroid. > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sat, Apr 26, 2014 at 5:49 AM, John King > wrote: >> I'm just wondering are the SparkVector calculations really taking into >> account the sparsity or just converting to dense? >> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King >> wrote: >>> >>> I've been trying to use the Naive Bayes classifier. Each example in the >>> dataset is about 2 million features, only about 20-50 of which are non-zero, >>> so the vectors are very sparse. I keep running out of memory though, even >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 million >>> examples. And I would also like to note that I'm using the sparse vector >>> class. >> >>
Re: Running out of memory Naive Bayes
Which version of mllib are you using? For Spark 1.0, mllib will support sparse feature vector which will improve performance a lot when computing the distance between points and centroid. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Apr 26, 2014 at 5:49 AM, John King wrote: > I'm just wondering are the SparkVector calculations really taking into > account the sparsity or just converting to dense? > > > On Fri, Apr 25, 2014 at 10:06 PM, John King > wrote: >> >> I've been trying to use the Naive Bayes classifier. Each example in the >> dataset is about 2 million features, only about 20-50 of which are non-zero, >> so the vectors are very sparse. I keep running out of memory though, even >> for about 1000 examples on 30gb RAM while the entire dataset is 4 million >> examples. And I would also like to note that I'm using the sparse vector >> class. > >
Re: Running out of memory Naive Bayes
I'm just wondering are the SparkVector calculations really taking into account the sparsity or just converting to dense? On Fri, Apr 25, 2014 at 10:06 PM, John King wrote: > I've been trying to use the Naive Bayes classifier. Each example in the > dataset is about 2 million features, only about 20-50 of which are > non-zero, so the vectors are very sparse. I keep running out of memory > though, even for about 1000 examples on 30gb RAM while the entire dataset > is 4 million examples. And I would also like to note that I'm using the > sparse vector class. >