Re: Text clustering

Philippe Lamarche Sat, 06 Dec 2008 04:54:48 -0800

Hi,

I used the Tanimoto distance. As I understand it, it's almost like the
cosine distance, with a range between 0 and infinity as opposed to 0 and
3.14. Seems to work well.





On Fri, Dec 5, 2008 at 11:54 PM, dipesh <[EMAIL PROTECTED]> wrote:

> Hi Philippe,
>
> I'm also doing some work on text clustering with feature extraction. For
> text clustering the Cosine Distance is considered a better Similarity
> metrics than the Eucledian Distance Measure. I couldn't find
> CosineDistanceMeasure in Mahout, did u use Cosine Distance Measure in your
> clustering project?
>
> regards,
> Dipesh
>
> On Fri, Dec 5, 2008 at 11:45 PM, Philippe Lamarche <
> [EMAIL PROTECTED]> wrote:
>
> > I will try to do the same.
> >
> > On Fri, Dec 5, 2008 at 8:40 AM, Grant Ingersoll <[EMAIL PROTECTED]>
> > wrote:
> >
> > >
> > > On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:
> > >
> > >  Sure :-) I haven't got my project on me at the moment but should be
> able
> > >> to
> > >> get at it some time before Xmas so will look through it again and send
> > you
> > >> anything that may be useful.
> > >>
> > >
> > > Cool, just add a patch to JIRA, if you can.  I think we could work
> > together
> > > to create a Text Clustering "example".
> > >
> > >
> > >
> > >
> > >>
> > >>
> > >> 2008/12/5 Grant Ingersoll <[EMAIL PROTECTED]>
> > >>
> > >>  I seem to recall some discussion a while back about being able to add
> > >>> labels to the vectors/matrices, but I don't know the status of the
> > patch.
> > >>>
> > >>> At any rate, very cool that you are using it for text clustering.  I
> > >>> still
> > >>> have on my list to write up how to do this and to write some
> supporting
> > >>> code
> > >>> as well.  So, if either of you cares to contribute, that would be
> most
> > >>> useful.
> > >>>
> > >>> -Grant
> > >>>
> > >>>
> > >>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
> > >>>
> > >>> Hi Phillippe,
> > >>>
> > >>>>
> > >>>> I used the K-Means on TF-IDF vectors and wondered the same thing -
> > about
> > >>>> labelling the documents. I haven't got my code on me at the moment
> and
> > >>>> it
> > >>>> was a few months ago that I last looked at it (so I was also
> probably
> > >>>> using
> > >>>> an older version of Mahout)... but I seem to remember that I did
> just
> > as
> > >>>> you
> > >>>> are suggesting and simply attached a unique ID to each document
> which
> > >>>> got
> > >>>> passed through the map-reduce stages. This requires a bit of
> tinkering
> > >>>> with
> > >>>> the K-Means implementation but shouldn't be too much work.
> > >>>>
> > >>>> As for having massive vectors, you could try representing them as
> > sparse
> > >>>> vectors rather than the dense vectors the standard Mahout K-Means
> > >>>> algorithm
> > >>>> accepts, which gets rid of all the zero values in the document
> > vectors.
> > >>>> See
> > >>>> the Javadoc for details, it'll be more reliable than my memory :-)
> > >>>>
> > >>>> Richard
> > >>>>
> > >>>>
> > >>>> 2008/12/3 Philippe Lamarche <[EMAIL PROTECTED]>
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>>>
> > >>>>> I have a questions concerning text clustering and the current
> > >>>>> K-Means/vectors implementation.
> > >>>>>
> > >>>>> For a school project, I did some text clustering with a subset of
> the
> > >>>>> Enron
> > >>>>> corpus. I implemented a small M/R package that transforms text into
> > >>>>> TF-IDF
> > >>>>> vector space, and then I used a little modified version of the
> > >>>>> syntheticcontrol K-Means example. So far, all is fine.
> > >>>>>
> > >>>>> However, the output of the k-mean algorithm is vector, as is the
> > input.
> > >>>>> As
> > >>>>> I
> > >>>>> understand it, when text is transformed in vector space, the
> > >>>>> cardinality
> > >>>>> of
> > >>>>> the vector is the number of word in your global dictionary, all
> word
> > in
> > >>>>> all
> > >>>>> text being clustered. This, can grow up pretty quick. For example,
> > with
> > >>>>> only
> > >>>>> 27000 Enron emails, even when removing word that only appears in 2
> > >>>>> emails
> > >>>>> or
> > >>>>> less, the dictionary size is about 45000 words.
> > >>>>>
> > >>>>> My number one problem is this: how can we find out what document a
> > >>>>> vector
> > >>>>> is
> > >>>>> representing, when it comes out of the k-means algorithm? My
> favorite
> > >>>>> solution would be to have a unique id attached to each vector. Is
> > there
> > >>>>> such
> > >>>>> ID in the vector implementation? Is there a better solution? Is my
> > >>>>> approach
> > >>>>> to text clustering wrong?
> > >>>>>
> > >>>>> Thanks for the help,
> > >>>>>
> > >>>>> Philippe.
> > >>>>>
> > >>>>>
> > >>>>>  --------------------------
> > >>> Grant Ingersoll
> > >>>
> > >>> Lucene Helpful Hints:
> > >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > >>> http://wiki.apache.org/lucene-java/LuceneFAQ
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > > --------------------------
> > > Grant Ingersoll
> > >
> > > Lucene Helpful Hints:
> > > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > > http://wiki.apache.org/lucene-java/LuceneFAQ
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> ----------------------------------------
> "Help Ever Hurt Never"- Baba
>

Re: Text clustering

Reply via email to