Hello,
I am a novice.I want to classify the text into two classes. For this
purpose I want to use Naive Bayes model. I am using Python for it.
Here are the problems I am facing:
*Problem 1:* I wanted to use all words as features for the bag of words
model. Which means my features will be count of individual words. In this
case whenever a new word comes in the test data (which was never present in
the train data) I need to increase the size of the feature vector to
incorporate that word as well. Correct me if I am wrong. Can I do that in
the present Mllib NaiveBayes. Or what is the way in which I can incorporate
this?
*Problem 2:* As I was not able to proceed with all words I did some
pre-processing and figured out few features from the text. But using this
also is giving errors.
Right now I was testing for only one feature from the text that is count of
positive words. I am submitting the code below, along with the error:
#############Code
import tokenizer
import gettingWordLists as gl
from pyspark.mllib.classification import NaiveBayes
from numpy import array
from pyspark import SparkContext, SparkConf
conf = (SparkConf().setMaster("local[6]").setAppName("My
app").set("spark.executor.memory", "1g"))
sc=SparkContext(conf = conf)
# Getting the positive dict:
pos_list = []
pos_list = gl.getPositiveList()
tok = tokenizer.Tokenizer(preserve_case=False)
train_data = []
with open("training_file.csv","r") as train_file:
for line in train_file:
tokens = line.split(",")
msg = tokens[0]
sentiment = tokens[1]
count = 0
tokens = set(tok.tokenize(msg))
for i in tokens:
if i.encode('utf-8') in pos_list:
count+=1
if sentiment.__contains__('NEG'):
label = 0.0
else:
label = 1.0
feature = []
feature.append(label)
feature.append(float(count))
train_data.append(feature)
model = NaiveBayes.train(sc.parallelize(array(train_data)))
print model.pi
print model.theta
print "\n\n\n\n\n" , model.predict(array([5.0]))
##############
*This is the output:*
*[-2.24512292 -0.11195389][[ 0.] [ 0.]]Traceback (most recent call last):
File "naive_bayes_analyser.py", line 77, in <module> print "\n\n\n\n\n"
, model.predict(array([5.0])) File
"F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py",
line 101, in predict return numpy.argmax(self.pi + dot(x, self.theta))
ValueError: matrices are not aligned*
##############
*Problem 3*: As you can see the output for model.pi is -ve. That is prior
probabilities are negative. Can someone explain that also. Is it the log of
the probability?
Thanks,
--
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka