Guys, As to the questions of pre-processing, you could just migrate your logic to Spark before using K-means.
I only used Scala on Spark, and haven't used Python binding on Spark, but I think the basic steps must be the same. BTW, if your data set is big with huge sparse dimension feature vector, K-Means may not works as good as you expected. And I think this is still the optimization direction of Spark MLLib. On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi <aminn_...@yahoo.com.invalid> wrote: > Hi there, > > I would like to do "text clustering" using k-means and Spark on a massive > dataset. As you know, before running the k-means, I have to do > pre-processing methods such as TFIDF and NLTK on my big dataset. The > following is my code in python : > > if __name__ == '__main__': # Cluster a bunch of text documents. import re > import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv' > with open(filename, newline='') as f: try: newsreader = csv.reader(f) for > row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error > as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, > e)) remove_spl_char_regex = re.compile('[%s]' % > re.escape(string.punctuation)) # regex to remove special characters > remove_num = re.compile('[\d]+') #nltk.download() stop_words= > nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float) > a1 = a.strip().lower() a2 = remove_spl_char_regex.sub(" ",a1) # Remove > special characters a3 = remove_num.sub("", a2) #Remove numbers #Remove > stop words words = a3.split() filter_stop_words = [w for w in words if not > w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in > filter_stop_words] ws=sorted(stemed) #ws=re.findall(r"\w+", a1) for w in > ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items()) > > Can anyone explain to me how can I do the pre-processing step, before > running the k-means using spark. > > > Best Regards > > ....................................................... > > Amin Mohebbi > > PhD candidate in Software Engineering > at university of Malaysia > > Tel : +60 18 2040 017 > > > > E-Mail : tp025...@ex.apiit.edu.my > > amin_...@me.com > -- yangjun...@gmail.com http://hi.baidu.com/yjpro