Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Rahul Bhojwani
Hello,

I am a novice.I want to classify the text into two classes. For this
purpose I  want to use Naive Bayes model. I am using Python for it.

Here are the problems I am facing:

*Problem 1:* I wanted to use all words as features for the bag of words
model. Which means my features will be count of individual words. In this
case whenever a new word comes in the test data (which was never present in
the train data) I need to increase the size of the feature vector to
incorporate that word as well. Correct me if I am wrong. Can I do that in
the present Mllib NaiveBayes. Or what is the way in which I can incorporate
this?

*Problem 2:* As I was not able to proceed with all words I did some
pre-processing and figured out few features from the text. But using this
also is giving errors.
Right now I was testing for only one feature from the text that is count of
positive words. I am submitting the code below, along with the error:


#Code

import tokenizer
import gettingWordLists as gl
from pyspark.mllib.classification import NaiveBayes
from numpy import array
from pyspark import SparkContext, SparkConf

conf = (SparkConf().setMaster(local[6]).setAppName(My
app).set(spark.executor.memory, 1g))

sc=SparkContext(conf = conf)

# Getting the positive dict:
pos_list = []
pos_list = gl.getPositiveList()
tok = tokenizer.Tokenizer(preserve_case=False)


train_data  = []

with open(training_file.csv,r) as train_file:
for line in train_file:
tokens = line.split(,)
msg = tokens[0]
sentiment = tokens[1]
count = 0
tokens = set(tok.tokenize(msg))
for i in tokens:
if i.encode('utf-8') in pos_list:
count+=1
if sentiment.__contains__('NEG'):
label = 0.0
else:
label = 1.0
feature = []
feature.append(label)
feature.append(float(count))
train_data.append(feature)


model = NaiveBayes.train(sc.parallelize(array(train_data)))
print model.pi
print model.theta
print \n\n\n\n\n , model.predict(array([5.0]))

##


*This is the output:*














*[-2.24512292 -0.11195389][[ 0.] [ 0.]]Traceback (most recent call last):
File naive_bayes_analyser.py, line 77, in module print \n\n\n\n\n
, model.predict(array([5.0]))  File
F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py,
line 101, in predictreturn numpy.argmax(self.pi + dot(x, self.theta))
ValueError: matrices are not aligned*

##

*Problem 3*: As you can see the output for model.pi is -ve. That is prior
probabilities are negative. Can someone explain that also. Is it the log of
the probability?



Thanks,
-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka


Re: Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Xiangrui Meng
1) The feature dimension should be a fixed number before you run
NaiveBayes. If you use bag of words, you need to handle the
word-to-index dictionary by yourself. You can either ignore the words
that never appear in training (because they have no effect in
prediction), or use hashing to randomly project words to a fixed-sized
feature space (collision may happen).

3) Yes, we saved the log conditional probabilities. So to compute the
likelihood, we only need summation.

Best,
Xiangrui

On Tue, Jul 8, 2014 at 12:01 AM, Rahul Bhojwani
rahulbhojwani2...@gmail.com wrote:
 I am really sorry. Its actually my mistake. My problem 2 is wrong because
 using a single feature is a senseless thing. Sorry for the inconvenience.
 But still I will be waiting for the solutions for problem 1 and 3.

 Thanks,


 On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani
 rahulbhojwani2...@gmail.com wrote:

 Hello,

 I am a novice.I want to classify the text into two classes. For this
 purpose I  want to use Naive Bayes model. I am using Python for it.

 Here are the problems I am facing:

 Problem 1: I wanted to use all words as features for the bag of words
 model. Which means my features will be count of individual words. In this
 case whenever a new word comes in the test data (which was never present in
 the train data) I need to increase the size of the feature vector to
 incorporate that word as well. Correct me if I am wrong. Can I do that in
 the present Mllib NaiveBayes. Or what is the way in which I can incorporate
 this?

 Problem 2: As I was not able to proceed with all words I did some
 pre-processing and figured out few features from the text. But using this
 also is giving errors.
 Right now I was testing for only one feature from the text that is count
 of positive words. I am submitting the code below, along with the error:


 #Code

 import tokenizer
 import gettingWordLists as gl
 from pyspark.mllib.classification import NaiveBayes
 from numpy import array
 from pyspark import SparkContext, SparkConf

 conf = (SparkConf().setMaster(local[6]).setAppName(My
 app).set(spark.executor.memory, 1g))

 sc=SparkContext(conf = conf)

 # Getting the positive dict:
 pos_list = []
 pos_list = gl.getPositiveList()
 tok = tokenizer.Tokenizer(preserve_case=False)


 train_data  = []

 with open(training_file.csv,r) as train_file:
 for line in train_file:
 tokens = line.split(,)
 msg = tokens[0]
 sentiment = tokens[1]
 count = 0
 tokens = set(tok.tokenize(msg))
 for i in tokens:
 if i.encode('utf-8') in pos_list:
 count+=1
 if sentiment.__contains__('NEG'):
 label = 0.0
 else:
 label = 1.0
 feature = []
 feature.append(label)
 feature.append(float(count))
 train_data.append(feature)


 model = NaiveBayes.train(sc.parallelize(array(train_data)))
 print model.pi
 print model.theta
 print \n\n\n\n\n , model.predict(array([5.0]))

 ##
 This is the output:

 [-2.24512292 -0.11195389]
 [[ 0.]
  [ 0.]]





 Traceback (most recent call last):
   File naive_bayes_analyser.py, line 77, in module
 print \n\n\n\n\n , model.predict(array([5.0]))
   File
 F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py, line
  101, in predict
 return numpy.argmax(self.pi + dot(x, self.theta))
 ValueError: matrices are not aligned

 ##

 Problem 3: As you can see the output for model.pi is -ve. That is prior
 probabilities are negative. Can someone explain that also. Is it the log of
 the probability?



 Thanks,
 --
 Rahul K Bhojwani
 3rd Year B.Tech
 Computer Science and Engineering
 National Institute of Technology, Karnataka




 --
 Rahul K Bhojwani
 3rd Year B.Tech
 Computer Science and Engineering
 National Institute of Technology, Karnataka


Re: Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Rahul Bhojwani
Thanks Xiangrui. You have solved almost all my problems :)


On Wed, Jul 9, 2014 at 1:47 AM, Xiangrui Meng men...@gmail.com wrote:

 1) The feature dimension should be a fixed number before you run
 NaiveBayes. If you use bag of words, you need to handle the
 word-to-index dictionary by yourself. You can either ignore the words
 that never appear in training (because they have no effect in
 prediction), or use hashing to randomly project words to a fixed-sized
 feature space (collision may happen).

 3) Yes, we saved the log conditional probabilities. So to compute the
 likelihood, we only need summation.

 Best,
 Xiangrui

 On Tue, Jul 8, 2014 at 12:01 AM, Rahul Bhojwani
 rahulbhojwani2...@gmail.com wrote:
  I am really sorry. Its actually my mistake. My problem 2 is wrong because
  using a single feature is a senseless thing. Sorry for the inconvenience.
  But still I will be waiting for the solutions for problem 1 and 3.
 
  Thanks,
 
 
  On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani
  rahulbhojwani2...@gmail.com wrote:
 
  Hello,
 
  I am a novice.I want to classify the text into two classes. For this
  purpose I  want to use Naive Bayes model. I am using Python for it.
 
  Here are the problems I am facing:
 
  Problem 1: I wanted to use all words as features for the bag of words
  model. Which means my features will be count of individual words. In
 this
  case whenever a new word comes in the test data (which was never
 present in
  the train data) I need to increase the size of the feature vector to
  incorporate that word as well. Correct me if I am wrong. Can I do that
 in
  the present Mllib NaiveBayes. Or what is the way in which I can
 incorporate
  this?
 
  Problem 2: As I was not able to proceed with all words I did some
  pre-processing and figured out few features from the text. But using
 this
  also is giving errors.
  Right now I was testing for only one feature from the text that is count
  of positive words. I am submitting the code below, along with the error:
 
 
  #Code
 
  import tokenizer
  import gettingWordLists as gl
  from pyspark.mllib.classification import NaiveBayes
  from numpy import array
  from pyspark import SparkContext, SparkConf
 
  conf = (SparkConf().setMaster(local[6]).setAppName(My
  app).set(spark.executor.memory, 1g))
 
  sc=SparkContext(conf = conf)
 
  # Getting the positive dict:
  pos_list = []
  pos_list = gl.getPositiveList()
  tok = tokenizer.Tokenizer(preserve_case=False)
 
 
  train_data  = []
 
  with open(training_file.csv,r) as train_file:
  for line in train_file:
  tokens = line.split(,)
  msg = tokens[0]
  sentiment = tokens[1]
  count = 0
  tokens = set(tok.tokenize(msg))
  for i in tokens:
  if i.encode('utf-8') in pos_list:
  count+=1
  if sentiment.__contains__('NEG'):
  label = 0.0
  else:
  label = 1.0
  feature = []
  feature.append(label)
  feature.append(float(count))
  train_data.append(feature)
 
 
  model = NaiveBayes.train(sc.parallelize(array(train_data)))
  print model.pi
  print model.theta
  print \n\n\n\n\n , model.predict(array([5.0]))
 
  ##
  This is the output:
 
  [-2.24512292 -0.11195389]
  [[ 0.]
   [ 0.]]
 
 
 
 
 
  Traceback (most recent call last):
File naive_bayes_analyser.py, line 77, in module
  print \n\n\n\n\n , model.predict(array([5.0]))
File
  F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py,
 line
   101, in predict
  return numpy.argmax(self.pi + dot(x, self.theta))
  ValueError: matrices are not aligned
 
  ##
 
  Problem 3: As you can see the output for model.pi is -ve. That is prior
  probabilities are negative. Can someone explain that also. Is it the
 log of
  the probability?
 
 
 
  Thanks,
  --
  Rahul K Bhojwani
  3rd Year B.Tech
  Computer Science and Engineering
  National Institute of Technology, Karnataka
 
 
 
 
  --
  Rahul K Bhojwani
  3rd Year B.Tech
  Computer Science and Engineering
  National Institute of Technology, Karnataka




-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka