Mahout has an example of using naive bayes to classify 20 news group. but
how to just classify paragraphs (e.g. twitter message, movie review) in
text files such as:
Text files has content like:
--
text paragraph 1 class
See
http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/
for classifying twitter messages.
Lucene has support for ngrams, stopwords, porter stemmer, snowball stemmer,
language specific analyzers etc...
Mahout uses Lucene
Suneel, thanks a lot.
I assume the example you mentioned was generating a numerical vector for
each paragraph, is it right?
now, to further improve the performance, I may add other features from
other data set into this vector and make it much longer, then use the
enriched vector for naive