[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277770#comment-14277770
 ] 

RJ Nowling commented on SPARK-4894:
-----------------------------------

Hi [~josephkb], lots to think about!

In general, I'm a big fan of multiple small changes over time rather than one 
big change.  They're easier to verify and review.  Since MLLib is going through 
an interface refactoring to become ML anyway, we can focus on the Bernoulli NB 
change now and worry about a redesign of the API later.

What do you have in mind for other feature and label types?  I briefly reviewed 
Factorie -- their concept of Factors may be over complicated for Naive Bayes 
but I want to learn more about your ideas.  Do you have a few concrete examples 
of how Factors could be used with NB?  And for continuous labels, are you 
thinking of something like the Gaussian NB in sklearn?

>From bioinformatics, I know that folks tend to encode categorical variables 
>incorrectly.  E.g., for a DNA sequence consisting of A, T, C, G, and possibly 
>gaps, each position in a sequence should be encoded as four (five) features, 
>one for each nucleotide.  When folks try to represent each position as one 
>feature with the bases as numbers (A=1, T=2, etc.), this results in incorrect 
>distance metrics. E.g., ATT will differ from TTT by 1 but ATT will differ from 
>CTT by 2. By using one feature for each of the four (five) possibilities, you 
>get correct distances and can even weight mutations and deletions using BLOSUM 
>matrices and such.  For this type of case, I think the solution there is 
>education and documentation, not complicated type systems.




> Add Bernoulli-variant of Naive Bayes
> ------------------------------------
>
>                 Key: SPARK-4894
>                 URL: https://issues.apache.org/jira/browse/SPARK-4894
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to