Re: Text clustering

Grant Ingersoll Thu, 04 Dec 2008 18:27:52 -0800

I seem to recall some discussion a while back about being able to addlabels to the vectors/matrices, but I don't know the status of thepatch.

At any rate, very cool that you are using it for text clustering. Istill have on my list to write up how to do this and to write somesupporting code as well. So, if either of you cares to contribute,that would be most useful.


-Grant

On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:

Hi Phillippe,
I used the K-Means on TF-IDF vectors and wondered the same thing -aboutlabelling the documents. I haven't got my code on me at the momentand itwas a few months ago that I last looked at it (so I was alsoprobably usingan older version of Mahout)... but I seem to remember that I didjust as youare suggesting and simply attached a unique ID to each documentwhich gotpassed through the map-reduce stages. This requires a bit oftinkering with
the K-Means implementation but shouldn't be too much work.
As for having massive vectors, you could try representing them assparsevectors rather than the dense vectors the standard Mahout K-Meansalgorithmaccepts, which gets rid of all the zero values in the documentvectors. See
the Javadoc for details, it'll be more reliable than my memory :-)

Richard


2008/12/3 Philippe Lamarche <[EMAIL PROTECTED]>
Hi,

I have a questions concerning text clustering and the current
K-Means/vectors implementation.
For a school project, I did some text clustering with a subset ofthe Enroncorpus. I implemented a small M/R package that transforms text intoTF-IDF
vector space, and then I used a little modified version of the
syntheticcontrol K-Means example. So far, all is fine.
However, the output of the k-mean algorithm is vector, as is theinput. As
I
understand it, when text is transformed in vector space, thecardinality ofthe vector is the number of word in your global dictionary, allword in alltext being clustered. This, can grow up pretty quick. For example,with
only
27000 Enron emails, even when removing word that only appears in 2emails
or
less, the dictionary size is about 45000 words.
My number one problem is this: how can we find out what document avector
is
representing, when it comes out of the k-means algorithm? My favorite
solution would be to have a unique id attached to each vector. Isthere
such
ID in the vector implementation? Is there a better solution? Is myapproach
to text clustering wrong?

Thanks for the help,

Philippe.


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Text clustering

Reply via email to