Hi, I have a questions concerning text clustering and the current K-Means/vectors implementation.
For a school project, I did some text clustering with a subset of the Enron corpus. I implemented a small M/R package that transforms text into TF-IDF vector space, and then I used a little modified version of the syntheticcontrol K-Means example. So far, all is fine. However, the output of the k-mean algorithm is vector, as is the input. As I understand it, when text is transformed in vector space, the cardinality of the vector is the number of word in your global dictionary, all word in all text being clustered. This, can grow up pretty quick. For example, with only 27000 Enron emails, even when removing word that only appears in 2 emails or less, the dictionary size is about 45000 words. My number one problem is this: how can we find out what document a vector is representing, when it comes out of the k-means algorithm? My favorite solution would be to have a unique id attached to each vector. Is there such ID in the vector implementation? Is there a better solution? Is my approach to text clustering wrong? Thanks for the help, Philippe.
