Re: Error and doubts in using Mllib Naive bayes for text clasification

Xiangrui Meng Tue, 08 Jul 2014 13:18:13 -0700

1) The feature dimension should be a fixed number before you run
NaiveBayes. If you use bag of words, you need to handle the
word-to-index dictionary by yourself. You can either ignore the words
that never appear in training (because they have no effect in
prediction), or use hashing to randomly project words to a fixed-sized
feature space (collision may happen).


3) Yes, we saved the log conditional probabilities. So to compute the
likelihood, we only need summation.

Best,
Xiangrui

On Tue, Jul 8, 2014 at 12:01 AM, Rahul Bhojwani
<rahulbhojwani2...@gmail.com> wrote:
> I am really sorry. Its actually my mistake. My problem 2 is wrong because
> using a single feature is a senseless thing. Sorry for the inconvenience.
> But still I will be waiting for the solutions for problem 1 and 3.
>
> Thanks,
>
>
> On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani
> <rahulbhojwani2...@gmail.com> wrote:
>>
>> Hello,
>>
>> I am a novice.I want to classify the text into two classes. For this
>> purpose I  want to use Naive Bayes model. I am using Python for it.
>>
>> Here are the problems I am facing:
>>
>> Problem 1: I wanted to use all words as features for the bag of words
>> model. Which means my features will be count of individual words. In this
>> case whenever a new word comes in the test data (which was never present in
>> the train data) I need to increase the size of the feature vector to
>> incorporate that word as well. Correct me if I am wrong. Can I do that in
>> the present Mllib NaiveBayes. Or what is the way in which I can incorporate
>> this?
>>
>> Problem 2: As I was not able to proceed with all words I did some
>> pre-processing and figured out few features from the text. But using this
>> also is giving errors.
>> Right now I was testing for only one feature from the text that is count
>> of positive words. I am submitting the code below, along with the error:
>>
>>
>> #############Code
>>
>> import tokenizer
>> import gettingWordLists as gl
>> from pyspark.mllib.classification import NaiveBayes
>> from numpy import array
>> from pyspark import SparkContext, SparkConf
>>
>> conf = (SparkConf().setMaster("local[6]").setAppName("My
>> app").set("spark.executor.memory", "1g"))
>>
>> sc=SparkContext(conf = conf)
>>
>> # Getting the positive dict:
>> pos_list = []
>> pos_list = gl.getPositiveList()
>> tok = tokenizer.Tokenizer(preserve_case=False)
>>
>>
>> train_data  = []
>>
>> with open("training_file.csv","r") as train_file:
>>     for line in train_file:
>>         tokens = line.split(",")
>>         msg = tokens[0]
>>         sentiment = tokens[1]
>>         count = 0
>>         tokens = set(tok.tokenize(msg))
>>         for i in tokens:
>>             if i.encode('utf-8') in pos_list:
>>                 count+=1
>>         if sentiment.__contains__('NEG'):
>>             label = 0.0
>>         else:
>>             label = 1.0
>>         feature = []
>>         feature.append(label)
>>         feature.append(float(count))
>>         train_data.append(feature)
>>
>>
>> model = NaiveBayes.train(sc.parallelize(array(train_data)))
>> print model.pi
>> print model.theta
>> print "\n\n\n\n\n" , model.predict(array([5.0]))
>>
>> ##############
>> This is the output:
>>
>> [-2.24512292 -0.11195389]
>> [[ 0.]
>>  [ 0.]]
>>
>>
>>
>>
>>
>> Traceback (most recent call last):
>>   File "naive_bayes_analyser.py", line 77, in <module>
>>     print "\n\n\n\n\n" , model.predict(array([5.0]))
>>   File
>> "F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py", line
>>  101, in predict
>>     return numpy.argmax(self.pi + dot(x, self.theta))
>> ValueError: matrices are not aligned
>>
>> ##############
>>
>> Problem 3: As you can see the output for model.pi is -ve. That is prior
>> probabilities are negative. Can someone explain that also. Is it the log of
>> the probability?
>>
>>
>>
>> Thanks,
>> --
>> Rahul K Bhojwani
>> 3rd Year B.Tech
>> Computer Science and Engineering
>> National Institute of Technology, Karnataka
>
>
>
>
> --
> Rahul K Bhojwani
> 3rd Year B.Tech
> Computer Science and Engineering
> National Institute of Technology, Karnataka

Re: Error and doubts in using Mllib Naive bayes for text clasification

Reply via email to