Error and doubts in using Mllib Naive bayes for text clasification
Hello, I am a novice.I want to classify the text into two classes. For this purpose I want to use Naive Bayes model. I am using Python for it. Here are the problems I am facing: *Problem 1:* I wanted to use all words as features for the bag of words model. Which means my features will be count of individual words. In this case whenever a new word comes in the test data (which was never present in the train data) I need to increase the size of the feature vector to incorporate that word as well. Correct me if I am wrong. Can I do that in the present Mllib NaiveBayes. Or what is the way in which I can incorporate this? *Problem 2:* As I was not able to proceed with all words I did some pre-processing and figured out few features from the text. But using this also is giving errors. Right now I was testing for only one feature from the text that is count of positive words. I am submitting the code below, along with the error: #Code import tokenizer import gettingWordLists as gl from pyspark.mllib.classification import NaiveBayes from numpy import array from pyspark import SparkContext, SparkConf conf = (SparkConf().setMaster(local[6]).setAppName(My app).set(spark.executor.memory, 1g)) sc=SparkContext(conf = conf) # Getting the positive dict: pos_list = [] pos_list = gl.getPositiveList() tok = tokenizer.Tokenizer(preserve_case=False) train_data = [] with open(training_file.csv,r) as train_file: for line in train_file: tokens = line.split(,) msg = tokens[0] sentiment = tokens[1] count = 0 tokens = set(tok.tokenize(msg)) for i in tokens: if i.encode('utf-8') in pos_list: count+=1 if sentiment.__contains__('NEG'): label = 0.0 else: label = 1.0 feature = [] feature.append(label) feature.append(float(count)) train_data.append(feature) model = NaiveBayes.train(sc.parallelize(array(train_data))) print model.pi print model.theta print \n\n\n\n\n , model.predict(array([5.0])) ## *This is the output:* *[-2.24512292 -0.11195389][[ 0.] [ 0.]]Traceback (most recent call last): File naive_bayes_analyser.py, line 77, in module print \n\n\n\n\n , model.predict(array([5.0])) File F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py, line 101, in predictreturn numpy.argmax(self.pi + dot(x, self.theta)) ValueError: matrices are not aligned* ## *Problem 3*: As you can see the output for model.pi is -ve. That is prior probabilities are negative. Can someone explain that also. Is it the log of the probability? Thanks, -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka
Re: Error and doubts in using Mllib Naive bayes for text clasification
1) The feature dimension should be a fixed number before you run NaiveBayes. If you use bag of words, you need to handle the word-to-index dictionary by yourself. You can either ignore the words that never appear in training (because they have no effect in prediction), or use hashing to randomly project words to a fixed-sized feature space (collision may happen). 3) Yes, we saved the log conditional probabilities. So to compute the likelihood, we only need summation. Best, Xiangrui On Tue, Jul 8, 2014 at 12:01 AM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: I am really sorry. Its actually my mistake. My problem 2 is wrong because using a single feature is a senseless thing. Sorry for the inconvenience. But still I will be waiting for the solutions for problem 1 and 3. Thanks, On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Hello, I am a novice.I want to classify the text into two classes. For this purpose I want to use Naive Bayes model. I am using Python for it. Here are the problems I am facing: Problem 1: I wanted to use all words as features for the bag of words model. Which means my features will be count of individual words. In this case whenever a new word comes in the test data (which was never present in the train data) I need to increase the size of the feature vector to incorporate that word as well. Correct me if I am wrong. Can I do that in the present Mllib NaiveBayes. Or what is the way in which I can incorporate this? Problem 2: As I was not able to proceed with all words I did some pre-processing and figured out few features from the text. But using this also is giving errors. Right now I was testing for only one feature from the text that is count of positive words. I am submitting the code below, along with the error: #Code import tokenizer import gettingWordLists as gl from pyspark.mllib.classification import NaiveBayes from numpy import array from pyspark import SparkContext, SparkConf conf = (SparkConf().setMaster(local[6]).setAppName(My app).set(spark.executor.memory, 1g)) sc=SparkContext(conf = conf) # Getting the positive dict: pos_list = [] pos_list = gl.getPositiveList() tok = tokenizer.Tokenizer(preserve_case=False) train_data = [] with open(training_file.csv,r) as train_file: for line in train_file: tokens = line.split(,) msg = tokens[0] sentiment = tokens[1] count = 0 tokens = set(tok.tokenize(msg)) for i in tokens: if i.encode('utf-8') in pos_list: count+=1 if sentiment.__contains__('NEG'): label = 0.0 else: label = 1.0 feature = [] feature.append(label) feature.append(float(count)) train_data.append(feature) model = NaiveBayes.train(sc.parallelize(array(train_data))) print model.pi print model.theta print \n\n\n\n\n , model.predict(array([5.0])) ## This is the output: [-2.24512292 -0.11195389] [[ 0.] [ 0.]] Traceback (most recent call last): File naive_bayes_analyser.py, line 77, in module print \n\n\n\n\n , model.predict(array([5.0])) File F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py, line 101, in predict return numpy.argmax(self.pi + dot(x, self.theta)) ValueError: matrices are not aligned ## Problem 3: As you can see the output for model.pi is -ve. That is prior probabilities are negative. Can someone explain that also. Is it the log of the probability? Thanks, -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka
Re: Error and doubts in using Mllib Naive bayes for text clasification
Thanks Xiangrui. You have solved almost all my problems :) On Wed, Jul 9, 2014 at 1:47 AM, Xiangrui Meng men...@gmail.com wrote: 1) The feature dimension should be a fixed number before you run NaiveBayes. If you use bag of words, you need to handle the word-to-index dictionary by yourself. You can either ignore the words that never appear in training (because they have no effect in prediction), or use hashing to randomly project words to a fixed-sized feature space (collision may happen). 3) Yes, we saved the log conditional probabilities. So to compute the likelihood, we only need summation. Best, Xiangrui On Tue, Jul 8, 2014 at 12:01 AM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: I am really sorry. Its actually my mistake. My problem 2 is wrong because using a single feature is a senseless thing. Sorry for the inconvenience. But still I will be waiting for the solutions for problem 1 and 3. Thanks, On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Hello, I am a novice.I want to classify the text into two classes. For this purpose I want to use Naive Bayes model. I am using Python for it. Here are the problems I am facing: Problem 1: I wanted to use all words as features for the bag of words model. Which means my features will be count of individual words. In this case whenever a new word comes in the test data (which was never present in the train data) I need to increase the size of the feature vector to incorporate that word as well. Correct me if I am wrong. Can I do that in the present Mllib NaiveBayes. Or what is the way in which I can incorporate this? Problem 2: As I was not able to proceed with all words I did some pre-processing and figured out few features from the text. But using this also is giving errors. Right now I was testing for only one feature from the text that is count of positive words. I am submitting the code below, along with the error: #Code import tokenizer import gettingWordLists as gl from pyspark.mllib.classification import NaiveBayes from numpy import array from pyspark import SparkContext, SparkConf conf = (SparkConf().setMaster(local[6]).setAppName(My app).set(spark.executor.memory, 1g)) sc=SparkContext(conf = conf) # Getting the positive dict: pos_list = [] pos_list = gl.getPositiveList() tok = tokenizer.Tokenizer(preserve_case=False) train_data = [] with open(training_file.csv,r) as train_file: for line in train_file: tokens = line.split(,) msg = tokens[0] sentiment = tokens[1] count = 0 tokens = set(tok.tokenize(msg)) for i in tokens: if i.encode('utf-8') in pos_list: count+=1 if sentiment.__contains__('NEG'): label = 0.0 else: label = 1.0 feature = [] feature.append(label) feature.append(float(count)) train_data.append(feature) model = NaiveBayes.train(sc.parallelize(array(train_data))) print model.pi print model.theta print \n\n\n\n\n , model.predict(array([5.0])) ## This is the output: [-2.24512292 -0.11195389] [[ 0.] [ 0.]] Traceback (most recent call last): File naive_bayes_analyser.py, line 77, in module print \n\n\n\n\n , model.predict(array([5.0])) File F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py, line 101, in predict return numpy.argmax(self.pi + dot(x, self.theta)) ValueError: matrices are not aligned ## Problem 3: As you can see the output for model.pi is -ve. That is prior probabilities are negative. Can someone explain that also. Is it the log of the probability? Thanks, -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka