[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul K Bhojwani updated SPARK-2433:
------------------------------------

    Summary: The MLlib implementation for Naive Bayes in Spark 0.9.1 is having 
an implementation bug.  (was: The MLlib implementation for Naive Bayes in Spark 
0.9.1 is having a implementation bug.)

> The MLlib implementation for Naive Bayes in Spark 0.9.1 is having an 
> implementation bug.
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-2433
>                 URL: https://issues.apache.org/jira/browse/SPARK-2433
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 0.9.1
>         Environment: Any 
>            Reporter: Rahul K Bhojwani
>              Labels: easyfix, test
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Don't have much experience with reporting errors. This is first time. If 
> something is not clear please feel free to contact me (details given below)
> In the pyspark mllib library. 
> Path : \spark-0.9.1\python\pyspark\mllib\classification.py
> Class: NaiveBayesModel
> Method:  self.predict
> Earlier Implementation:
> def predict(self, x):
>     """Return the most likely class for a data vector x"""
>     return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
>         
> New Implementation:
> No:1
> def predict(self, x):
>     """Return the most likely class for a data vector x"""
>     return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
> No:2
> def predict(self, x):
>     """Return the most likely class for a data vector x"""
>     return numpy.argmax(self.pi + dot(x,self.theta.T))
> Explanation:
> No:1 is correct according to me. Don't know about No:2.
> Error one:
> The matrix self.theta is of dimension [n_classes , n_features]. 
> while the matrix x is of dimension [1 , n_features].
> Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
> It will always give error:  "ValueError: matrices are not aligned"
> In the commented example given in the classification.py, n_classes = 
> n_features = 2. That's why no error.
> Both Implementation no.1 and Implementation no. 2 takes care of it.
> Error 2:
> As basic implementation of naive bayes is: P(class_n | sample) = 
> count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
> P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
> and taking the class with max value.
> That's what implementation 1 is doing.
> In Implementation 2: 
> Its basically class with max value :
> ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
> P(feature_n|class_n) * P(class_n))
> Don't know if it gives the exact result.
> Thanks
> Rahul Bhojwani
> rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to