Re: Running out of memory Naive Bayes

2014-04-28 Thread DB Tsai
Our customer asked us to implement Naive Bayes which should be able to at
least train news20 one year ago, and we implemented for them in Hadoop
using distributed cache to store the model.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, Apr 27, 2014 at 11:03 PM, Xiangrui Meng men...@gmail.com wrote:

 How big is your problem and how many labels? -Xiangrui

 On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai dbt...@stanford.edu wrote:
  Hi Xiangrui,
 
  We also run into this issue at Alpine Data Labs. We ended up using LRU
 cache
  to store the counts, and splitting those least used counts to distributed
  cache in HDFS.
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Even the features are sparse, the conditional probabilities are stored
  in a dense matrix. With 200 labels and 2 million features, you need to
  store at least 4e8 doubles on the driver node. With multiple
  partitions, you may need more memory on the driver. Could you try
  reducing the number of partitions and giving driver more ram and see
  whether it can help? -Xiangrui
 
  On Sun, Apr 27, 2014 at 3:33 PM, John King 
 usedforprinting...@gmail.com
  wrote:
   I'm already using the SparseVector class.
  
   ~200 labels
  
  
   On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng men...@gmail.com
   wrote:
  
   How many labels does your dataset have? -Xiangrui
  
   On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai dbt...@stanford.edu
 wrote:
Which version of mllib are you using? For Spark 1.0, mllib will
support sparse feature vector which will improve performance a lot
when computing the distance between points and centroid.
   
Sincerely,
   
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
   
   
On Sat, Apr 26, 2014 at 5:49 AM, John King
usedforprinting...@gmail.com wrote:
I'm just wondering are the SparkVector calculations really taking
into
account the sparsity or just converting to dense?
   
   
On Fri, Apr 25, 2014 at 10:06 PM, John King
usedforprinting...@gmail.com
wrote:
   
I've been trying to use the Naive Bayes classifier. Each example
 in
the
dataset is about 2 million features, only about 20-50 of which
 are
non-zero,
so the vectors are very sparse. I keep running out of memory
though,
even
for about 1000 examples on 30gb RAM while the entire dataset is 4
million
examples. And I would also like to note that I'm using the sparse
vector
class.
   
   
  
  
 
 



Re: Running out of memory Naive Bayes

2014-04-28 Thread Matei Zaharia
Not sure if this is always ideal for Naive Bayes, but you could also hash the 
features into a lower-dimensional space (e.g. reduce it to 50,000 features). 
For each feature simply take MurmurHash3(featureID) % 5 for example.

Matei

On Apr 27, 2014, at 11:24 PM, DB Tsai dbt...@stanford.edu wrote:

 Our customer asked us to implement Naive Bayes which should be able to at 
 least train news20 one year ago, and we implemented for them in Hadoop using 
 distributed cache to store the model.
 
 
 Sincerely,
 
 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
 On Sun, Apr 27, 2014 at 11:03 PM, Xiangrui Meng men...@gmail.com wrote:
 How big is your problem and how many labels? -Xiangrui
 
 On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai dbt...@stanford.edu wrote:
  Hi Xiangrui,
 
  We also run into this issue at Alpine Data Labs. We ended up using LRU cache
  to store the counts, and splitting those least used counts to distributed
  cache in HDFS.
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Even the features are sparse, the conditional probabilities are stored
  in a dense matrix. With 200 labels and 2 million features, you need to
  store at least 4e8 doubles on the driver node. With multiple
  partitions, you may need more memory on the driver. Could you try
  reducing the number of partitions and giving driver more ram and see
  whether it can help? -Xiangrui
 
  On Sun, Apr 27, 2014 at 3:33 PM, John King usedforprinting...@gmail.com
  wrote:
   I'm already using the SparseVector class.
  
   ~200 labels
  
  
   On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng men...@gmail.com
   wrote:
  
   How many labels does your dataset have? -Xiangrui
  
   On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai dbt...@stanford.edu wrote:
Which version of mllib are you using? For Spark 1.0, mllib will
support sparse feature vector which will improve performance a lot
when computing the distance between points and centroid.
   
Sincerely,
   
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
   
   
On Sat, Apr 26, 2014 at 5:49 AM, John King
usedforprinting...@gmail.com wrote:
I'm just wondering are the SparkVector calculations really taking
into
account the sparsity or just converting to dense?
   
   
On Fri, Apr 25, 2014 at 10:06 PM, John King
usedforprinting...@gmail.com
wrote:
   
I've been trying to use the Naive Bayes classifier. Each example in
the
dataset is about 2 million features, only about 20-50 of which are
non-zero,
so the vectors are very sparse. I keep running out of memory
though,
even
for about 1000 examples on 30gb RAM while the entire dataset is 4
million
examples. And I would also like to note that I'm using the sparse
vector
class.
   
   
  
  
 
 
 



Re: Running out of memory Naive Bayes

2014-04-26 Thread John King
I'm just wondering are the SparkVector calculations really taking into
account the sparsity or just converting to dense?


On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.comwrote:

 I've been trying to use the Naive Bayes classifier. Each example in the
 dataset is about 2 million features, only about 20-50 of which are
 non-zero, so the vectors are very sparse. I keep running out of memory
 though, even for about 1000 examples on 30gb RAM while the entire dataset
 is 4 million examples. And I would also like to note that I'm using the
 sparse vector class.



Re: Running out of memory Naive Bayes

2014-04-26 Thread DB Tsai
Which version of mllib are you using? For Spark 1.0, mllib will
support sparse feature vector which will improve performance a lot
when computing the distance between points and centroid.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sat, Apr 26, 2014 at 5:49 AM, John King usedforprinting...@gmail.com wrote:
 I'm just wondering are the SparkVector calculations really taking into
 account the sparsity or just converting to dense?


 On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.com
 wrote:

 I've been trying to use the Naive Bayes classifier. Each example in the
 dataset is about 2 million features, only about 20-50 of which are non-zero,
 so the vectors are very sparse. I keep running out of memory though, even
 for about 1000 examples on 30gb RAM while the entire dataset is 4 million
 examples. And I would also like to note that I'm using the sparse vector
 class.




Re: Running out of memory Naive Bayes

2014-04-26 Thread Xiangrui Meng
How many labels does your dataset have? -Xiangrui

On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai dbt...@stanford.edu wrote:
 Which version of mllib are you using? For Spark 1.0, mllib will
 support sparse feature vector which will improve performance a lot
 when computing the distance between points and centroid.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sat, Apr 26, 2014 at 5:49 AM, John King usedforprinting...@gmail.com 
 wrote:
 I'm just wondering are the SparkVector calculations really taking into
 account the sparsity or just converting to dense?


 On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.com
 wrote:

 I've been trying to use the Naive Bayes classifier. Each example in the
 dataset is about 2 million features, only about 20-50 of which are non-zero,
 so the vectors are very sparse. I keep running out of memory though, even
 for about 1000 examples on 30gb RAM while the entire dataset is 4 million
 examples. And I would also like to note that I'm using the sparse vector
 class.