[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631
 ] 

RJ Nowling edited comment on SPARK-4894 at 1/14/15 8:50 PM:
------------------------------------------------------------

Thanks [~lmcguire]!  I'll wait until next week in case you have time to put a 
patch together.

In the mean time, here were my thoughts for changes:
1. Add an optional `model` variable to the `NaiveBayes` object and class and 
`NaiveBayesModel`. It would be a string with a default value of `Multinomial`.  
For Bernoulli, we can use `Bernoulli`.

2.  In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta 
* testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * 
testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - 
exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for 
the 0-valued features.   (Breeze may not allow adding/subtracting scalars and 
vectors/matrices.)

In the current model, no term is added for rows of `testData` that have 0 
entries.  In the Bernoulli model, we would be adding a separate term for 
0-valued features.

Here is the sklearn source for comparison: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py 
(Look at `_joint_log_likelihood` in the `MultinomialNB` and `BernoulliNB` 
classes.)

Note that sklearn adds the neg prob to all features and subtracts it from 
features with 1-values.

[~mengxr], [~lmcguire], [~josephkb] Any thoughts or comments on any of the 
above?


was (Author: rnowling):
Thanks [~lmcguire]!  I'll wait until next week in case you have time to put a 
patch together.

In the mean time, here were my thoughts for changes:
1. Add an optional `model` variable to the `NaiveBayes` object and class and 
`NaiveBayesModel`. It would be a string with a default value of `Multinomial`.  
For Bernoulli, we can use `Bernoulli`.

2.  In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta 
* testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * 
testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - 
exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for 
the 0-valued features.   (Breeze may not allow adding/subtracting scalars and 
vectors/matrices.)

In the current model, no term is added for rows of `testData` that have 0 
entries.  In the Bernoulli model, we would be adding a separate term for 
0-valued features.

Here is the sklearn source for comparison: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py

Note that sklearn adds the neg prob to all features and subtracts it from 
features with 1-values.

[~mengxr], [~josephkb] Any thoughts or comments?

> Add Bernoulli-variant of Naive Bayes
> ------------------------------------
>
>                 Key: SPARK-4894
>                 URL: https://issues.apache.org/jira/browse/SPARK-4894
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to