See 
http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/
for classifying twitter messages.

Lucene has support for ngrams, stopwords, porter stemmer, snowball stemmer, 
language specific analyzers etc...
Mahout uses Lucene for vectorization (part of Mahout's seq2sparse process).  
See http://mahout.apache.org/users/basics/creating-vectors-from-text.html







On Thursday, January 16, 2014 10:57 PM, qiaoresearcher 
<qiaoresearc...@gmail.com> wrote:
 
Mahout has an example of using naive bayes to classify 20 news group. but
how to just classify paragraphs  (e.g. twitter message, movie review) in
text files such as:

Text files has content like:
----------------------------------------------------------
text paragraph 1                     class a
text paragraph 2                     class b
text paragraph 3                     class a
text paragraph 4                     class b
.............                                      ...

does it support n grams, stem, stop words, etc?

thanks for any suggestions.

Reply via email to