[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631 ]
RJ Nowling commented on SPARK-4894: ----------------------------------- Thanks [~lmcguire]! I'll wait until next week in case you have time to put a patch together. In the mean time, here were my thoughts for changes: 1. Add an optional `model` variable to the `NaiveBayes` object and class and `NaiveBayesModel`. It would be a string with a default value of `Multinomial`. For Bernoulli, we can use `Bernoulli`. 2. In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta * testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for the 0-valued features. (Breeze may not allow adding/subtracting scalars and vectors/matrices.) In the current model, no term is added for rows of `testData` that have 0 entries. In the Bernoulli model, we would be adding a separate term for 0-valued features. Here is the sklearn source for comparison: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py Note that sklearn adds the neg prob to all features and subtracts it from features with 1-values. [~mengxr], [~josephkb] Any thoughts or comments? > Add Bernoulli-variant of Naive Bayes > ------------------------------------ > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib > Affects Versions: 1.2.0 > Reporter: RJ Nowling > Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org