mahout text mining

2014-01-16 Thread qiaoresearcher
Mahout has an example of using naive bayes to classify 20 news group. but how to just classify paragraphs (e.g. twitter message, movie review) in text files such as: Text files has content like: -- text paragraph 1 class

Re: mahout text mining

2014-01-16 Thread Suneel Marthi
See http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/ for classifying twitter messages. Lucene has support for ngrams, stopwords, porter stemmer, snowball stemmer, language specific analyzers etc... Mahout uses Lucene

Re: mahout text mining

2014-01-16 Thread qiaoresearcher
Suneel, thanks a lot. I assume the example you mentioned was generating a numerical vector for each paragraph, is it right? now, to further improve the performance, I may add other features from other data set into this vector and make it much longer, then use the enriched vector for naive