Hi
I was looking at the naive Bayes classifier's implementation, due to
my surprise at the n-gram parameter being used.
My understanding of 'traditional' naive Bayes is that it only
considers probabilities related to single words/tokens, independent of
context. Is that not what the Mahout implementation does? Are the N-
grams used to also model N-sequences of tokens as "words" to be dealt
with in the algorithm? Or are they used as input in some other way?
It seems it uses "N-grams" of N tokens, not N characters, from what I
gather from NGrams.java. Or are they not related to token sequences
but to character sequences somehow?
Any help or pointers to materials the implementation is based on would
be appreciated. (I know that the Complementary Naive Bayes
implementation is quite different and based on a paper introducing
that method - but I'm wondering about the 'normal' Naive Bayes
implementation.)
Regards,
Loek
- Naive Bayes implementation Loek Cleophas
-