Hello, I am a novice.I want to classify the text into two classes. For this purpose I want to use Naive Bayes model. I am using Python for it.
Here are the problems I am facing: *Problem 1:* I wanted to use all words as features for the bag of words model. Which means my features will be count of individual words. In this case whenever a new word comes in the test data (which was never present in the train data) I need to increase the size of the feature vector to incorporate that word as well. Correct me if I am wrong. Can I do that in the present Mllib NaiveBayes. Or what is the way in which I can incorporate this? *Problem 2:* As I was not able to proceed with all words I did some pre-processing and figured out few features from the text. But using this also is giving errors. Right now I was testing for only one feature from the text that is count of positive words. I am submitting the code below, along with the error: #############Code import tokenizer import gettingWordLists as gl from pyspark.mllib.classification import NaiveBayes from numpy import array from pyspark import SparkContext, SparkConf conf = (SparkConf().setMaster("local[6]").setAppName("My app").set("spark.executor.memory", "1g")) sc=SparkContext(conf = conf) # Getting the positive dict: pos_list = [] pos_list = gl.getPositiveList() tok = tokenizer.Tokenizer(preserve_case=False) train_data = [] with open("training_file.csv","r") as train_file: for line in train_file: tokens = line.split(",") msg = tokens[0] sentiment = tokens[1] count = 0 tokens = set(tok.tokenize(msg)) for i in tokens: if i.encode('utf-8') in pos_list: count+=1 if sentiment.__contains__('NEG'): label = 0.0 else: label = 1.0 feature = [] feature.append(label) feature.append(float(count)) train_data.append(feature) model = NaiveBayes.train(sc.parallelize(array(train_data))) print model.pi print model.theta print "\n\n\n\n\n" , model.predict(array([5.0])) ############## *This is the output:* *[-2.24512292 -0.11195389][[ 0.] [ 0.]]Traceback (most recent call last): File "naive_bayes_analyser.py", line 77, in <module> print "\n\n\n\n\n" , model.predict(array([5.0])) File "F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py", line 101, in predict return numpy.argmax(self.pi + dot(x, self.theta)) ValueError: matrices are not aligned* ############## *Problem 3*: As you can see the output for model.pi is -ve. That is prior probabilities are negative. Can someone explain that also. Is it the log of the probability? Thanks, -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka