Re: Text clustering

Philippe Lamarche Fri, 05 Dec 2008 06:46:28 -0800

I will try to do the same.

On Fri, Dec 5, 2008 at 8:40 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


>
> On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:
>
>  Sure :-) I haven't got my project on me at the moment but should be able
>> to
>> get at it some time before Xmas so will look through it again and send you
>> anything that may be useful.
>>
>
> Cool, just add a patch to JIRA, if you can.  I think we could work together
> to create a Text Clustering "example".
>
>
>
>
>>
>>
>> 2008/12/5 Grant Ingersoll <[EMAIL PROTECTED]>
>>
>>  I seem to recall some discussion a while back about being able to add
>>> labels to the vectors/matrices, but I don't know the status of the patch.
>>>
>>> At any rate, very cool that you are using it for text clustering.  I
>>> still
>>> have on my list to write up how to do this and to write some supporting
>>> code
>>> as well.  So, if either of you cares to contribute, that would be most
>>> useful.
>>>
>>> -Grant
>>>
>>>
>>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
>>>
>>> Hi Phillippe,
>>>
>>>>
>>>> I used the K-Means on TF-IDF vectors and wondered the same thing - about
>>>> labelling the documents. I haven't got my code on me at the moment and
>>>> it
>>>> was a few months ago that I last looked at it (so I was also probably
>>>> using
>>>> an older version of Mahout)... but I seem to remember that I did just as
>>>> you
>>>> are suggesting and simply attached a unique ID to each document which
>>>> got
>>>> passed through the map-reduce stages. This requires a bit of tinkering
>>>> with
>>>> the K-Means implementation but shouldn't be too much work.
>>>>
>>>> As for having massive vectors, you could try representing them as sparse
>>>> vectors rather than the dense vectors the standard Mahout K-Means
>>>> algorithm
>>>> accepts, which gets rid of all the zero values in the document vectors.
>>>> See
>>>> the Javadoc for details, it'll be more reliable than my memory :-)
>>>>
>>>> Richard
>>>>
>>>>
>>>> 2008/12/3 Philippe Lamarche <[EMAIL PROTECTED]>
>>>>
>>>> Hi,
>>>>
>>>>>
>>>>> I have a questions concerning text clustering and the current
>>>>> K-Means/vectors implementation.
>>>>>
>>>>> For a school project, I did some text clustering with a subset of the
>>>>> Enron
>>>>> corpus. I implemented a small M/R package that transforms text into
>>>>> TF-IDF
>>>>> vector space, and then I used a little modified version of the
>>>>> syntheticcontrol K-Means example. So far, all is fine.
>>>>>
>>>>> However, the output of the k-mean algorithm is vector, as is the input.
>>>>> As
>>>>> I
>>>>> understand it, when text is transformed in vector space, the
>>>>> cardinality
>>>>> of
>>>>> the vector is the number of word in your global dictionary, all word in
>>>>> all
>>>>> text being clustered. This, can grow up pretty quick. For example, with
>>>>> only
>>>>> 27000 Enron emails, even when removing word that only appears in 2
>>>>> emails
>>>>> or
>>>>> less, the dictionary size is about 45000 words.
>>>>>
>>>>> My number one problem is this: how can we find out what document a
>>>>> vector
>>>>> is
>>>>> representing, when it comes out of the k-means algorithm? My favorite
>>>>> solution would be to have a unique id attached to each vector. Is there
>>>>> such
>>>>> ID in the vector implementation? Is there a better solution? Is my
>>>>> approach
>>>>> to text clustering wrong?
>>>>>
>>>>> Thanks for the help,
>>>>>
>>>>> Philippe.
>>>>>
>>>>>
>>>>>  --------------------------
>>> Grant Ingersoll
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>

Re: Text clustering

Reply via email to