The LDA implementation kind of clusters on terms to generate topics. It sounds like you want some co-occurrence analysis, I'm not sure that the clustering algorithms are best for that, but perhaps others have insight. I could imagine doing this with HBase or Pig and just keeping a matrix where each cell kept track of the number of times both terms appear in a document (or even within some window in a document).

On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:

Hi.
I have been using org.apache.mahout.utils.vectors.lucene.Driver
and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster documents in our Lucene index and it works great! I am wondering though, is it possible
to use Mahout to cluster terms?

I want to cluster terms that often appear in the same documents.

Thank you.

--
Ole-Martin Mørk
http://twitter.com/olemartin
http://flickr.com/olemartin

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to