Hi Philippe, I'm also doing some work on text clustering with feature extraction. For text clustering the Cosine Distance is considered a better Similarity metrics than the Eucledian Distance Measure. I couldn't find CosineDistanceMeasure in Mahout, did u use Cosine Distance Measure in your clustering project?
regards, Dipesh On Fri, Dec 5, 2008 at 11:45 PM, Philippe Lamarche < [EMAIL PROTECTED]> wrote: > I will try to do the same. > > On Fri, Dec 5, 2008 at 8:40 AM, Grant Ingersoll <[EMAIL PROTECTED]> > wrote: > > > > > On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote: > > > > Sure :-) I haven't got my project on me at the moment but should be able > >> to > >> get at it some time before Xmas so will look through it again and send > you > >> anything that may be useful. > >> > > > > Cool, just add a patch to JIRA, if you can. I think we could work > together > > to create a Text Clustering "example". > > > > > > > > > >> > >> > >> 2008/12/5 Grant Ingersoll <[EMAIL PROTECTED]> > >> > >> I seem to recall some discussion a while back about being able to add > >>> labels to the vectors/matrices, but I don't know the status of the > patch. > >>> > >>> At any rate, very cool that you are using it for text clustering. I > >>> still > >>> have on my list to write up how to do this and to write some supporting > >>> code > >>> as well. So, if either of you cares to contribute, that would be most > >>> useful. > >>> > >>> -Grant > >>> > >>> > >>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote: > >>> > >>> Hi Phillippe, > >>> > >>>> > >>>> I used the K-Means on TF-IDF vectors and wondered the same thing - > about > >>>> labelling the documents. I haven't got my code on me at the moment and > >>>> it > >>>> was a few months ago that I last looked at it (so I was also probably > >>>> using > >>>> an older version of Mahout)... but I seem to remember that I did just > as > >>>> you > >>>> are suggesting and simply attached a unique ID to each document which > >>>> got > >>>> passed through the map-reduce stages. This requires a bit of tinkering > >>>> with > >>>> the K-Means implementation but shouldn't be too much work. > >>>> > >>>> As for having massive vectors, you could try representing them as > sparse > >>>> vectors rather than the dense vectors the standard Mahout K-Means > >>>> algorithm > >>>> accepts, which gets rid of all the zero values in the document > vectors. > >>>> See > >>>> the Javadoc for details, it'll be more reliable than my memory :-) > >>>> > >>>> Richard > >>>> > >>>> > >>>> 2008/12/3 Philippe Lamarche <[EMAIL PROTECTED]> > >>>> > >>>> Hi, > >>>> > >>>>> > >>>>> I have a questions concerning text clustering and the current > >>>>> K-Means/vectors implementation. > >>>>> > >>>>> For a school project, I did some text clustering with a subset of the > >>>>> Enron > >>>>> corpus. I implemented a small M/R package that transforms text into > >>>>> TF-IDF > >>>>> vector space, and then I used a little modified version of the > >>>>> syntheticcontrol K-Means example. So far, all is fine. > >>>>> > >>>>> However, the output of the k-mean algorithm is vector, as is the > input. > >>>>> As > >>>>> I > >>>>> understand it, when text is transformed in vector space, the > >>>>> cardinality > >>>>> of > >>>>> the vector is the number of word in your global dictionary, all word > in > >>>>> all > >>>>> text being clustered. This, can grow up pretty quick. For example, > with > >>>>> only > >>>>> 27000 Enron emails, even when removing word that only appears in 2 > >>>>> emails > >>>>> or > >>>>> less, the dictionary size is about 45000 words. > >>>>> > >>>>> My number one problem is this: how can we find out what document a > >>>>> vector > >>>>> is > >>>>> representing, when it comes out of the k-means algorithm? My favorite > >>>>> solution would be to have a unique id attached to each vector. Is > there > >>>>> such > >>>>> ID in the vector implementation? Is there a better solution? Is my > >>>>> approach > >>>>> to text clustering wrong? > >>>>> > >>>>> Thanks for the help, > >>>>> > >>>>> Philippe. > >>>>> > >>>>> > >>>>> -------------------------- > >>> Grant Ingersoll > >>> > >>> Lucene Helpful Hints: > >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance > >>> http://wiki.apache.org/lucene-java/LuceneFAQ > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > > -------------------------- > > Grant Ingersoll > > > > Lucene Helpful Hints: > > http://wiki.apache.org/lucene-java/BasicsOfPerformance > > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > > > > > > > > > > > > > > -- ---------------------------------------- "Help Ever Hurt Never"- Baba
