On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:
Sure :-) I haven't got my project on me at the moment but should be
able to
get at it some time before Xmas so will look through it again and
send you
anything that may be useful.
Cool, just add a patch to JIRA, if you can. I think we could work
together to create a Text Clustering "example".
2008/12/5 Grant Ingersoll <[EMAIL PROTECTED]>
I seem to recall some discussion a while back about being able to add
labels to the vectors/matrices, but I don't know the status of the
patch.
At any rate, very cool that you are using it for text clustering.
I still
have on my list to write up how to do this and to write some
supporting code
as well. So, if either of you cares to contribute, that would be
most
useful.
-Grant
On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
Hi Phillippe,
I used the K-Means on TF-IDF vectors and wondered the same thing -
about
labelling the documents. I haven't got my code on me at the moment
and it
was a few months ago that I last looked at it (so I was also
probably
using
an older version of Mahout)... but I seem to remember that I did
just as
you
are suggesting and simply attached a unique ID to each document
which got
passed through the map-reduce stages. This requires a bit of
tinkering
with
the K-Means implementation but shouldn't be too much work.
As for having massive vectors, you could try representing them as
sparse
vectors rather than the dense vectors the standard Mahout K-Means
algorithm
accepts, which gets rid of all the zero values in the document
vectors.
See
the Javadoc for details, it'll be more reliable than my memory :-)
Richard
2008/12/3 Philippe Lamarche <[EMAIL PROTECTED]>
Hi,
I have a questions concerning text clustering and the current
K-Means/vectors implementation.
For a school project, I did some text clustering with a subset of
the
Enron
corpus. I implemented a small M/R package that transforms text into
TF-IDF
vector space, and then I used a little modified version of the
syntheticcontrol K-Means example. So far, all is fine.
However, the output of the k-mean algorithm is vector, as is the
input.
As
I
understand it, when text is transformed in vector space, the
cardinality
of
the vector is the number of word in your global dictionary, all
word in
all
text being clustered. This, can grow up pretty quick. For
example, with
only
27000 Enron emails, even when removing word that only appears in
2 emails
or
less, the dictionary size is about 45000 words.
My number one problem is this: how can we find out what document
a vector
is
representing, when it comes out of the k-means algorithm? My
favorite
solution would be to have a unique id attached to each vector. Is
there
such
ID in the vector implementation? Is there a better solution? Is my
approach
to text clustering wrong?
Thanks for the help,
Philippe.
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ