Re: Running out of memory Naive Bayes

2014-04-28 Thread DB Tsai
Our customer asked us to implement Naive Bayes which should be able to at least train news20 one year ago, and we implemented for them in Hadoop using distributed cache to store the model. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Running out of memory Naive Bayes

2014-04-28 Thread Matei Zaharia
Not sure if this is always ideal for Naive Bayes, but you could also hash the features into a lower-dimensional space (e.g. reduce it to 50,000 features). For each feature simply take MurmurHash3(featureID) % 5 for example. Matei On Apr 27, 2014, at 11:24 PM, DB Tsai dbt...@stanford.edu

Re: Running out of memory Naive Bayes

2014-04-26 Thread John King
I'm just wondering are the SparkVector calculations really taking into account the sparsity or just converting to dense? On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.comwrote: I've been trying to use the Naive Bayes classifier. Each example in the dataset is about 2

Re: Running out of memory Naive Bayes

2014-04-26 Thread DB Tsai
Which version of mllib are you using? For Spark 1.0, mllib will support sparse feature vector which will improve performance a lot when computing the distance between points and centroid. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Running out of memory Naive Bayes

2014-04-26 Thread Xiangrui Meng
How many labels does your dataset have? -Xiangrui On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai dbt...@stanford.edu wrote: Which version of mllib are you using? For Spark 1.0, mllib will support sparse feature vector which will improve performance a lot when computing the distance between points

Running out of memory Naive Bayes

2014-04-25 Thread John King
I've been trying to use the Naive Bayes classifier. Each example in the dataset is about 2 million features, only about 20-50 of which are non-zero, so the vectors are very sparse. I keep running out of memory though, even for about 1000 examples on 30gb RAM while the entire dataset is 4 million