Hi there, I would like to do "text clustering" using k-means and Spark on a massive dataset. As you know, before running the k-means, I have to do pre-processing methods such as TFIDF and NLTK on my big dataset. The following is my code in python :
| | if __name__ == '__main__': | | | # Cluster a bunch of text documents. | | | import re | | | import sys | | | | | | k = 6 | | | vocab = {} | | | xs = [] | | | ns=[] | | | cat=[] | | | filename='2013-01.csv' | | | with open(filename, newline='') as f: | | | try: | | | newsreader = csv.reader(f) | | | for row in newsreader: | | | ns.append(row[3]) | | | cat.append(row[4]) | | | except csv.Error as e: | | | sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, e)) | | | | | | | | | remove_spl_char_regex = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove special characters | | | remove_num = re.compile('[\d]+') | | | #nltk.download() | | | stop_words=nltk.corpus.stopwords.words('english') | | | | | | for a in ns: | | | x = defaultdict(float) | | | | | | | | | a1 = a.strip().lower() | | | a2 = remove_spl_char_regex.sub(" ",a1) # Remove special characters | | | a3 = remove_num.sub("", a2) #Remove numbers | | | #Remove stop words | | | words = a3.split() | | | filter_stop_words = [w for w in words if not w in stop_words] | | | stemed = [PorterStemmer().stem_word(w) for w in filter_stop_words] | | | ws=sorted(stemed) | | | | | | | | | #ws=re.findall(r"\w+", a1) | | | for w in ws: | | | vocab.setdefault(w, len(vocab)) | | | x[vocab[w]] += 1 | | | xs.append(x.items()) | | | Can anyone explain to me how can I do the pre-processing step, before running the k-means using spark. Best Regards ....................................................... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com