Guys,

As to the questions of pre-processing, you could just migrate your logic to
Spark before using K-means.

I only used Scala on Spark, and haven't used Python binding on Spark, but I
think the basic steps must be the same.

BTW, if your data set is big with huge sparse dimension feature vector,
K-Means may not works as good as you expected. And I think this is still
the optimization direction of Spark MLLib.

On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi <aminn_...@yahoo.com.invalid>
wrote:

> Hi there,
>
> I would like to do "text clustering" using  k-means and Spark on a massive
> dataset. As you know, before running the k-means, I have to do
> pre-processing methods such as TFIDF and NLTK on my big dataset. The
> following is my code in python :
>
> if __name__ == '__main__': # Cluster a bunch of text documents. import re
> import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv'
> with open(filename, newline='') as f: try: newsreader = csv.reader(f) for
> row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error
> as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num,
> e))  remove_spl_char_regex = re.compile('[%s]' %
> re.escape(string.punctuation)) # regex to remove special characters
> remove_num = re.compile('[\d]+') #nltk.download() stop_words=
> nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float)
> a1 = a.strip().lower() a2 = remove_spl_char_regex.sub(" ",a1) # Remove
> special characters a3 = remove_num.sub("", a2) #Remove numbers #Remove
> stop words words = a3.split() filter_stop_words = [w for w in words if not
> w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in
> filter_stop_words] ws=sorted(stemed)  #ws=re.findall(r"\w+", a1) for w in
> ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items())
>
> Can anyone explain to me how can I do the pre-processing step, before
> running the k-means using spark.
>
>
> Best Regards
>
> .......................................................
>
> Amin Mohebbi
>
> PhD candidate in Software Engineering
>  at university of Malaysia
>
> Tel : +60 18 2040 017
>
>
>
> E-Mail : tp025...@ex.apiit.edu.my
>
>               amin_...@me.com
>



-- 
yangjun...@gmail.com
http://hi.baidu.com/yjpro

Reply via email to