Hi Phillippe, I used the K-Means on TF-IDF vectors and wondered the same thing - about labelling the documents. I haven't got my code on me at the moment and it was a few months ago that I last looked at it (so I was also probably using an older version of Mahout)... but I seem to remember that I did just as you are suggesting and simply attached a unique ID to each document which got passed through the map-reduce stages. This requires a bit of tinkering with the K-Means implementation but shouldn't be too much work.
As for having massive vectors, you could try representing them as sparse vectors rather than the dense vectors the standard Mahout K-Means algorithm accepts, which gets rid of all the zero values in the document vectors. See the Javadoc for details, it'll be more reliable than my memory :-) Richard 2008/12/3 Philippe Lamarche <[EMAIL PROTECTED]> > Hi, > > I have a questions concerning text clustering and the current > K-Means/vectors implementation. > > For a school project, I did some text clustering with a subset of the Enron > corpus. I implemented a small M/R package that transforms text into TF-IDF > vector space, and then I used a little modified version of the > syntheticcontrol K-Means example. So far, all is fine. > > However, the output of the k-mean algorithm is vector, as is the input. As > I > understand it, when text is transformed in vector space, the cardinality of > the vector is the number of word in your global dictionary, all word in all > text being clustered. This, can grow up pretty quick. For example, with > only > 27000 Enron emails, even when removing word that only appears in 2 emails > or > less, the dictionary size is about 45000 words. > > My number one problem is this: how can we find out what document a vector > is > representing, when it comes out of the k-means algorithm? My favorite > solution would be to have a unique id attached to each vector. Is there > such > ID in the vector implementation? Is there a better solution? Is my approach > to text clustering wrong? > > Thanks for the help, > > Philippe. >
