I will try to do the same. On Fri, Dec 5, 2008 at 8:40 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> > On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote: > > Sure :-) I haven't got my project on me at the moment but should be able >> to >> get at it some time before Xmas so will look through it again and send you >> anything that may be useful. >> > > Cool, just add a patch to JIRA, if you can. I think we could work together > to create a Text Clustering "example". > > > > >> >> >> 2008/12/5 Grant Ingersoll <[EMAIL PROTECTED]> >> >> I seem to recall some discussion a while back about being able to add >>> labels to the vectors/matrices, but I don't know the status of the patch. >>> >>> At any rate, very cool that you are using it for text clustering. I >>> still >>> have on my list to write up how to do this and to write some supporting >>> code >>> as well. So, if either of you cares to contribute, that would be most >>> useful. >>> >>> -Grant >>> >>> >>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote: >>> >>> Hi Phillippe, >>> >>>> >>>> I used the K-Means on TF-IDF vectors and wondered the same thing - about >>>> labelling the documents. I haven't got my code on me at the moment and >>>> it >>>> was a few months ago that I last looked at it (so I was also probably >>>> using >>>> an older version of Mahout)... but I seem to remember that I did just as >>>> you >>>> are suggesting and simply attached a unique ID to each document which >>>> got >>>> passed through the map-reduce stages. This requires a bit of tinkering >>>> with >>>> the K-Means implementation but shouldn't be too much work. >>>> >>>> As for having massive vectors, you could try representing them as sparse >>>> vectors rather than the dense vectors the standard Mahout K-Means >>>> algorithm >>>> accepts, which gets rid of all the zero values in the document vectors. >>>> See >>>> the Javadoc for details, it'll be more reliable than my memory :-) >>>> >>>> Richard >>>> >>>> >>>> 2008/12/3 Philippe Lamarche <[EMAIL PROTECTED]> >>>> >>>> Hi, >>>> >>>>> >>>>> I have a questions concerning text clustering and the current >>>>> K-Means/vectors implementation. >>>>> >>>>> For a school project, I did some text clustering with a subset of the >>>>> Enron >>>>> corpus. I implemented a small M/R package that transforms text into >>>>> TF-IDF >>>>> vector space, and then I used a little modified version of the >>>>> syntheticcontrol K-Means example. So far, all is fine. >>>>> >>>>> However, the output of the k-mean algorithm is vector, as is the input. >>>>> As >>>>> I >>>>> understand it, when text is transformed in vector space, the >>>>> cardinality >>>>> of >>>>> the vector is the number of word in your global dictionary, all word in >>>>> all >>>>> text being clustered. This, can grow up pretty quick. For example, with >>>>> only >>>>> 27000 Enron emails, even when removing word that only appears in 2 >>>>> emails >>>>> or >>>>> less, the dictionary size is about 45000 words. >>>>> >>>>> My number one problem is this: how can we find out what document a >>>>> vector >>>>> is >>>>> representing, when it comes out of the k-means algorithm? My favorite >>>>> solution would be to have a unique id attached to each vector. Is there >>>>> such >>>>> ID in the vector implementation? Is there a better solution? Is my >>>>> approach >>>>> to text clustering wrong? >>>>> >>>>> Thanks for the help, >>>>> >>>>> Philippe. >>>>> >>>>> >>>>> -------------------------- >>> Grant Ingersoll >>> >>> Lucene Helpful Hints: >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance >>> http://wiki.apache.org/lucene-java/LuceneFAQ >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> > -------------------------- > Grant Ingersoll > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > >
