an example of a Naive-Bayes classifier trained on character n-grams is the LangDetect library. (see http://code.google.com/p/language-detection/)
Agree with Ted that it should be relatively easy to build one. On Wednesday, October 9, 2013 6:40 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: Yes. Should work to use character n-grams. There are oddities in the stats because the different n-grams are not independent, but Naive Bayes methods are in such a state of sin that it shouldn't hurt any worse. No... I don't think that there is a capability built in to generate the character n-grams. Should be relatively trivial to build. On Wed, Oct 9, 2013 at 3:18 AM, Dean Jones <dean.m.jo...@gmail.com> wrote: > Hello folks, > > I see that it's possible to use mahout to train a naive bayes > classifier using n-grams as features (or I guess, strictly speaking, > mahout can be used to generate sequence files containing n-grams; I > suspect the naive bayes trainer is indifferent to the form of features > it trains on). Is there any facility to generate character n-grams > instead of word n-grams? > > Thanks, > > Dean. >