Error and doubts in using Mllib Naive bayes for text clasification

Rahul Bhojwani Mon, 07 Jul 2014 23:46:25 -0700

Hello,

I am a novice.I want to classify the text into two classes. For this
purpose I  want to use Naive Bayes model. I am using Python for it.


Here are the problems I am facing:

*Problem 1:* I wanted to use all words as features for the bag of words
model. Which means my features will be count of individual words. In this
case whenever a new word comes in the test data (which was never present in
the train data) I need to increase the size of the feature vector to
incorporate that word as well. Correct me if I am wrong. Can I do that in
the present Mllib NaiveBayes. Or what is the way in which I can incorporate
this?

*Problem 2:* As I was not able to proceed with all words I did some
pre-processing and figured out few features from the text. But using this
also is giving errors.
Right now I was testing for only one feature from the text that is count of
positive words. I am submitting the code below, along with the error:


#############Code

import tokenizer
import gettingWordLists as gl
from pyspark.mllib.classification import NaiveBayes
from numpy import array
from pyspark import SparkContext, SparkConf

conf = (SparkConf().setMaster("local[6]").setAppName("My
app").set("spark.executor.memory", "1g"))

sc=SparkContext(conf = conf)

# Getting the positive dict:
pos_list = []
pos_list = gl.getPositiveList()
tok = tokenizer.Tokenizer(preserve_case=False)


train_data  = []

with open("training_file.csv","r") as train_file:
    for line in train_file:
        tokens = line.split(",")
        msg = tokens[0]
        sentiment = tokens[1]
        count = 0
        tokens = set(tok.tokenize(msg))
        for i in tokens:
            if i.encode('utf-8') in pos_list:
                count+=1
        if sentiment.__contains__('NEG'):
            label = 0.0
        else:
            label = 1.0
        feature = []
        feature.append(label)
        feature.append(float(count))
        train_data.append(feature)


model = NaiveBayes.train(sc.parallelize(array(train_data)))
print model.pi
print model.theta
print "\n\n\n\n\n" , model.predict(array([5.0]))

##############


*This is the output:*














*[-2.24512292 -0.11195389][[ 0.] [ 0.]]Traceback (most recent call last):
File "naive_bayes_analyser.py", line 77, in <module>     print "\n\n\n\n\n"
, model.predict(array([5.0]))  File
"F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py",
line 101, in predict    return numpy.argmax(self.pi + dot(x, self.theta))
ValueError: matrices are not aligned*

##############

*Problem 3*: As you can see the output for model.pi is -ve. That is prior
probabilities are negative. Can someone explain that also. Is it the log of
the probability?



Thanks,
-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka

Error and doubts in using Mllib Naive bayes for text clasification

Reply via email to