Hi, I used the Tanimoto distance. As I understand it, it's almost like the cosine distance, with a range between 0 and infinity as opposed to 0 and 3.14. Seems to work well.
On Fri, Dec 5, 2008 at 11:54 PM, dipesh <[EMAIL PROTECTED]> wrote: > Hi Philippe, > > I'm also doing some work on text clustering with feature extraction. For > text clustering the Cosine Distance is considered a better Similarity > metrics than the Eucledian Distance Measure. I couldn't find > CosineDistanceMeasure in Mahout, did u use Cosine Distance Measure in your > clustering project? > > regards, > Dipesh > > On Fri, Dec 5, 2008 at 11:45 PM, Philippe Lamarche < > [EMAIL PROTECTED]> wrote: > > > I will try to do the same. > > > > On Fri, Dec 5, 2008 at 8:40 AM, Grant Ingersoll <[EMAIL PROTECTED]> > > wrote: > > > > > > > > On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote: > > > > > > Sure :-) I haven't got my project on me at the moment but should be > able > > >> to > > >> get at it some time before Xmas so will look through it again and send > > you > > >> anything that may be useful. > > >> > > > > > > Cool, just add a patch to JIRA, if you can. I think we could work > > together > > > to create a Text Clustering "example". > > > > > > > > > > > > > > >> > > >> > > >> 2008/12/5 Grant Ingersoll <[EMAIL PROTECTED]> > > >> > > >> I seem to recall some discussion a while back about being able to add > > >>> labels to the vectors/matrices, but I don't know the status of the > > patch. > > >>> > > >>> At any rate, very cool that you are using it for text clustering. I > > >>> still > > >>> have on my list to write up how to do this and to write some > supporting > > >>> code > > >>> as well. So, if either of you cares to contribute, that would be > most > > >>> useful. > > >>> > > >>> -Grant > > >>> > > >>> > > >>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote: > > >>> > > >>> Hi Phillippe, > > >>> > > >>>> > > >>>> I used the K-Means on TF-IDF vectors and wondered the same thing - > > about > > >>>> labelling the documents. I haven't got my code on me at the moment > and > > >>>> it > > >>>> was a few months ago that I last looked at it (so I was also > probably > > >>>> using > > >>>> an older version of Mahout)... but I seem to remember that I did > just > > as > > >>>> you > > >>>> are suggesting and simply attached a unique ID to each document > which > > >>>> got > > >>>> passed through the map-reduce stages. This requires a bit of > tinkering > > >>>> with > > >>>> the K-Means implementation but shouldn't be too much work. > > >>>> > > >>>> As for having massive vectors, you could try representing them as > > sparse > > >>>> vectors rather than the dense vectors the standard Mahout K-Means > > >>>> algorithm > > >>>> accepts, which gets rid of all the zero values in the document > > vectors. > > >>>> See > > >>>> the Javadoc for details, it'll be more reliable than my memory :-) > > >>>> > > >>>> Richard > > >>>> > > >>>> > > >>>> 2008/12/3 Philippe Lamarche <[EMAIL PROTECTED]> > > >>>> > > >>>> Hi, > > >>>> > > >>>>> > > >>>>> I have a questions concerning text clustering and the current > > >>>>> K-Means/vectors implementation. > > >>>>> > > >>>>> For a school project, I did some text clustering with a subset of > the > > >>>>> Enron > > >>>>> corpus. I implemented a small M/R package that transforms text into > > >>>>> TF-IDF > > >>>>> vector space, and then I used a little modified version of the > > >>>>> syntheticcontrol K-Means example. So far, all is fine. > > >>>>> > > >>>>> However, the output of the k-mean algorithm is vector, as is the > > input. > > >>>>> As > > >>>>> I > > >>>>> understand it, when text is transformed in vector space, the > > >>>>> cardinality > > >>>>> of > > >>>>> the vector is the number of word in your global dictionary, all > word > > in > > >>>>> all > > >>>>> text being clustered. This, can grow up pretty quick. For example, > > with > > >>>>> only > > >>>>> 27000 Enron emails, even when removing word that only appears in 2 > > >>>>> emails > > >>>>> or > > >>>>> less, the dictionary size is about 45000 words. > > >>>>> > > >>>>> My number one problem is this: how can we find out what document a > > >>>>> vector > > >>>>> is > > >>>>> representing, when it comes out of the k-means algorithm? My > favorite > > >>>>> solution would be to have a unique id attached to each vector. Is > > there > > >>>>> such > > >>>>> ID in the vector implementation? Is there a better solution? Is my > > >>>>> approach > > >>>>> to text clustering wrong? > > >>>>> > > >>>>> Thanks for the help, > > >>>>> > > >>>>> Philippe. > > >>>>> > > >>>>> > > >>>>> -------------------------- > > >>> Grant Ingersoll > > >>> > > >>> Lucene Helpful Hints: > > >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance > > >>> http://wiki.apache.org/lucene-java/LuceneFAQ > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > > -------------------------- > > > Grant Ingersoll > > > > > > Lucene Helpful Hints: > > > http://wiki.apache.org/lucene-java/BasicsOfPerformance > > > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > ---------------------------------------- > "Help Ever Hurt Never"- Baba >
