Thanks Xiangrui. You have solved almost all my problems :)
On Wed, Jul 9, 2014 at 1:47 AM, Xiangrui Meng men...@gmail.com wrote:
1) The feature dimension should be a fixed number before you run
NaiveBayes. If you use bag of words, you need to handle the
word-to-index dictionary by yourself. You can either ignore the words
that never appear in training (because they have no effect in
prediction), or use hashing to randomly project words to a fixed-sized
feature space (collision may happen).
3) Yes, we saved the log conditional probabilities. So to compute the
likelihood, we only need summation.
Best,
Xiangrui
On Tue, Jul 8, 2014 at 12:01 AM, Rahul Bhojwani
rahulbhojwani2...@gmail.com wrote:
I am really sorry. Its actually my mistake. My problem 2 is wrong because
using a single feature is a senseless thing. Sorry for the inconvenience.
But still I will be waiting for the solutions for problem 1 and 3.
Thanks,
On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani
rahulbhojwani2...@gmail.com wrote:
Hello,
I am a novice.I want to classify the text into two classes. For this
purpose I want to use Naive Bayes model. I am using Python for it.
Here are the problems I am facing:
Problem 1: I wanted to use all words as features for the bag of words
model. Which means my features will be count of individual words. In
this
case whenever a new word comes in the test data (which was never
present in
the train data) I need to increase the size of the feature vector to
incorporate that word as well. Correct me if I am wrong. Can I do that
in
the present Mllib NaiveBayes. Or what is the way in which I can
incorporate
this?
Problem 2: As I was not able to proceed with all words I did some
pre-processing and figured out few features from the text. But using
this
also is giving errors.
Right now I was testing for only one feature from the text that is count
of positive words. I am submitting the code below, along with the error:
#Code
import tokenizer
import gettingWordLists as gl
from pyspark.mllib.classification import NaiveBayes
from numpy import array
from pyspark import SparkContext, SparkConf
conf = (SparkConf().setMaster(local[6]).setAppName(My
app).set(spark.executor.memory, 1g))
sc=SparkContext(conf = conf)
# Getting the positive dict:
pos_list = []
pos_list = gl.getPositiveList()
tok = tokenizer.Tokenizer(preserve_case=False)
train_data = []
with open(training_file.csv,r) as train_file:
for line in train_file:
tokens = line.split(,)
msg = tokens[0]
sentiment = tokens[1]
count = 0
tokens = set(tok.tokenize(msg))
for i in tokens:
if i.encode('utf-8') in pos_list:
count+=1
if sentiment.__contains__('NEG'):
label = 0.0
else:
label = 1.0
feature = []
feature.append(label)
feature.append(float(count))
train_data.append(feature)
model = NaiveBayes.train(sc.parallelize(array(train_data)))
print model.pi
print model.theta
print \n\n\n\n\n , model.predict(array([5.0]))
##
This is the output:
[-2.24512292 -0.11195389]
[[ 0.]
[ 0.]]
Traceback (most recent call last):
File naive_bayes_analyser.py, line 77, in module
print \n\n\n\n\n , model.predict(array([5.0]))
File
F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py,
line
101, in predict
return numpy.argmax(self.pi + dot(x, self.theta))
ValueError: matrices are not aligned
##
Problem 3: As you can see the output for model.pi is -ve. That is prior
probabilities are negative. Can someone explain that also. Is it the
log of
the probability?
Thanks,
--
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka
--
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka
--
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka