At the high level I think you have these choices and more: 1) Hadoop Streaming, leverage some of your python could, but not all b/c you have to deal with map/reduce. 2) Use Mahout. 3) Use a distro of R that works with Hadoop ..
On Thu, Apr 24, 2014 at 1:58 PM, qiaoresearcher <qiaoresearc...@gmail.com>wrote: > I have Hadoop and python installed with nltk. Now I have an large input > file which has three columns: > column 1 | column 2 | column 3 > positive id1 some tweet message > negative id2 other tweet message > positive id3 tweet message > negative id4 tweet message > positive id5 tweet message > .... ... .... > > I want to use text mining to construct TFIDF vectors from the tweet > messages (also use stop words, stem, etc) and then use some classifier to > classify tweet message as positive or negative. I know how to do it just > using python and nltk. But how to do the same thing on hadoop? > > thanks! > > >