See http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/ for classifying twitter messages.
Lucene has support for ngrams, stopwords, porter stemmer, snowball stemmer, language specific analyzers etc... Mahout uses Lucene for vectorization (part of Mahout's seq2sparse process). See http://mahout.apache.org/users/basics/creating-vectors-from-text.html On Thursday, January 16, 2014 10:57 PM, qiaoresearcher <qiaoresearc...@gmail.com> wrote: Mahout has an example of using naive bayes to classify 20 news group. but how to just classify paragraphs (e.g. twitter message, movie review) in text files such as: Text files has content like: ---------------------------------------------------------- text paragraph 1 class a text paragraph 2 class b text paragraph 3 class a text paragraph 4 class b ............. ... does it support n grams, stem, stop words, etc? thanks for any suggestions.