[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rahul K Bhojwani updated SPARK-2433: ------------------------------------ Summary: The MLlib implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. (was: The MLlib implementation for Naive Bayes in Spark 0.9.1 is having a implementation bug.) > The MLlib implementation for Naive Bayes in Spark 0.9.1 is having an > implementation bug. > ---------------------------------------------------------------------------------------- > > Key: SPARK-2433 > URL: https://issues.apache.org/jira/browse/SPARK-2433 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark > Affects Versions: 0.9.1 > Environment: Any > Reporter: Rahul K Bhojwani > Labels: easyfix, test > Original Estimate: 1h > Remaining Estimate: 1h > > Don't have much experience with reporting errors. This is first time. If > something is not clear please feel free to contact me (details given below) > In the pyspark mllib library. > Path : \spark-0.9.1\python\pyspark\mllib\classification.py > Class: NaiveBayesModel > Method: self.predict > Earlier Implementation: > def predict(self, x): > """Return the most likely class for a data vector x""" > return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) > > New Implementation: > No:1 > def predict(self, x): > """Return the most likely class for a data vector x""" > return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) > No:2 > def predict(self, x): > """Return the most likely class for a data vector x""" > return numpy.argmax(self.pi + dot(x,self.theta.T)) > Explanation: > No:1 is correct according to me. Don't know about No:2. > Error one: > The matrix self.theta is of dimension [n_classes , n_features]. > while the matrix x is of dimension [1 , n_features]. > Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. > It will always give error: "ValueError: matrices are not aligned" > In the commented example given in the classification.py, n_classes = > n_features = 2. That's why no error. > Both Implementation no.1 and Implementation no. 2 takes care of it. > Error 2: > As basic implementation of naive bayes is: P(class_n | sample) = > count_feature_1 * P(feature_1 | class_n ) * count_feature_n * > P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) > and taking the class with max value. > That's what implementation 1 is doing. > In Implementation 2: > Its basically class with max value : > ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * > P(feature_n|class_n) * P(class_n)) > Don't know if it gives the exact result. > Thanks > Rahul Bhojwani > rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)