[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277770#comment-14277770 ]
RJ Nowling commented on SPARK-4894: ----------------------------------- Hi [~josephkb], lots to think about! In general, I'm a big fan of multiple small changes over time rather than one big change. They're easier to verify and review. Since MLLib is going through an interface refactoring to become ML anyway, we can focus on the Bernoulli NB change now and worry about a redesign of the API later. What do you have in mind for other feature and label types? I briefly reviewed Factorie -- their concept of Factors may be over complicated for Naive Bayes but I want to learn more about your ideas. Do you have a few concrete examples of how Factors could be used with NB? And for continuous labels, are you thinking of something like the Gaussian NB in sklearn? >From bioinformatics, I know that folks tend to encode categorical variables >incorrectly. E.g., for a DNA sequence consisting of A, T, C, G, and possibly >gaps, each position in a sequence should be encoded as four (five) features, >one for each nucleotide. When folks try to represent each position as one >feature with the bases as numbers (A=1, T=2, etc.), this results in incorrect >distance metrics. E.g., ATT will differ from TTT by 1 but ATT will differ from >CTT by 2. By using one feature for each of the four (five) possibilities, you >get correct distances and can even weight mutations and deletions using BLOSUM >matrices and such. For this type of case, I think the solution there is >education and documentation, not complicated type systems. > Add Bernoulli-variant of Naive Bayes > ------------------------------------ > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib > Affects Versions: 1.2.0 > Reporter: RJ Nowling > Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org