Hi there,
I would like to do "text clustering" using  k-means and Spark on a massive 
dataset. As you know, before running the k-means, I have to do pre-processing 
methods such as TFIDF and NLTK on my big dataset. The following is my code in 
python :

|
| if __name__ == '__main__': |
|  |  # Cluster a bunch of text documents. |
|  |  import re |
|  |  import sys |
|  |  |
|  |  k = 6 |
|  |  vocab = {} |
|  |  xs = [] |
|  |  ns=[] |
|  |  cat=[] |
|  |  filename='2013-01.csv' |
|  |  with open(filename, newline='') as f: |
|  |  try: |
|  |  newsreader = csv.reader(f) |
|  |  for row in newsreader: |
|  |  ns.append(row[3]) |
|  |  cat.append(row[4]) |
|  |  except csv.Error as e: |
|  |  sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, e)) |
|  |  |
|  |  |
|  |  remove_spl_char_regex = re.compile('[%s]' % 
re.escape(string.punctuation)) # regex to remove special characters |
|  |  remove_num = re.compile('[\d]+') |
|  |  #nltk.download() |
|  |  stop_words=nltk.corpus.stopwords.words('english') |
|  |  |
|  |  for a in ns: |
|  |  x = defaultdict(float) |
|  |  |
|  |  |
|  |  a1 = a.strip().lower() |
|  |  a2 = remove_spl_char_regex.sub(" ",a1) # Remove special characters |
|  |  a3 = remove_num.sub("", a2) #Remove numbers |
|  |  #Remove stop words |
|  |  words = a3.split() |
|  |  filter_stop_words = [w for w in words if not w in stop_words] |
|  |  stemed = [PorterStemmer().stem_word(w) for w in filter_stop_words] |
|  |  ws=sorted(stemed) |
|  |  |
|  |  |
|  |  #ws=re.findall(r"\w+", a1) |
|  |  for w in ws: |
|  |  vocab.setdefault(w, len(vocab)) |
|  |  x[vocab[w]] += 1 |
|  |  xs.append(x.items()) |
|  |



Can anyone explain to me how can I do the pre-processing step, before running 
the k-means using spark.
 
Best Regards

.......................................................

Amin Mohebbi

PhD candidate in Software Engineering 
 at university of Malaysia  

Tel : +60 18 2040 017



E-Mail : tp025...@ex.apiit.edu.my

              amin_...@me.com

Reply via email to