Our customer asked us to implement Naive Bayes which should be able to at
least train news20 one year ago, and we implemented for them in Hadoop
using distributed cache to store the model.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
Not sure if this is always ideal for Naive Bayes, but you could also hash the
features into a lower-dimensional space (e.g. reduce it to 50,000 features).
For each feature simply take MurmurHash3(featureID) % 5 for example.
Matei
On Apr 27, 2014, at 11:24 PM, DB Tsai dbt...@stanford.edu
I'm just wondering are the SparkVector calculations really taking into
account the sparsity or just converting to dense?
On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.comwrote:
I've been trying to use the Naive Bayes classifier. Each example in the
dataset is about 2
Which version of mllib are you using? For Spark 1.0, mllib will
support sparse feature vector which will improve performance a lot
when computing the distance between points and centroid.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
How many labels does your dataset have? -Xiangrui
On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai dbt...@stanford.edu wrote:
Which version of mllib are you using? For Spark 1.0, mllib will
support sparse feature vector which will improve performance a lot
when computing the distance between points
I've been trying to use the Naive Bayes classifier. Each example in the
dataset is about 2 million features, only about 20-50 of which are
non-zero, so the vectors are very sparse. I keep running out of memory
though, even for about 1000 examples on 30gb RAM while the entire dataset
is 4 million