Clustering documents by term (a la LDA or SVD) also leads to a nice
clustering of terms by just looking at "the transpose", right?  This is
literally the case for SVD: if M = U S V' is your SVD, where M is
represented as a row matrix and U and V are column matrices (document by
reduced-dimension and term by reduced dimension, respectively), then
typically you just keep V and S around.  In this case the transpose of V
has, as row vectors, the projection of each term onto the reduced
dimensional space, and doing clustering on that set of reduced vectors
performs "concept-aware" term clustering (and if you just want the system to
run as a search engine [find me the top terms "close" to a given term], you
just sort by descending dot-product on the rows of V).

For our LDA implementation, I'm not sure, but given the set of all topics,
just as each topic has a probability of producing a term, and so the
transpose of this matrix has the probability of any given term being
produced by each of the topics.  I'm not sure if our current implementation
has methods you can easily use to get access to this information and thereby
cluster the terms, however.

On Tue, Sep 29, 2009 at 1:05 PM, Grant Ingersoll <[email protected]>wrote:

> The LDA implementation kind of clusters on terms to generate topics.  It
> sounds like you want some co-occurrence analysis, I'm not sure that the
> clustering algorithms are best for that, but perhaps others have insight.
>  I could imagine doing this with HBase or Pig and just keeping a matrix
> where each cell kept track of the number of times both terms appear in a
> document (or even within some window in a document).
>
>
>
> On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:
>
>  Hi.
>> I have been using org.apache.mahout.utils.vectors.lucene.Driver
>> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster documents
>> in
>> our Lucene index and it works great! I am wondering though, is it possible
>> to use Mahout to cluster terms?
>>
>> I want to cluster terms that often appear in the same documents.
>>
>> Thank you.
>>
>> --
>> Ole-Martin Mørk
>> http://twitter.com/olemartin
>> http://flickr.com/olemartin
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Reply via email to