Re: Document Clustering

Jeff Eastman Thu, 28 May 2009 06:08:19 -0700

I did an initial cut at MAHOUT-65 but was blocked by theserialization/deserialization needed to fully address the requirements.Now that Gson is in lib it makes sense to use it. The Dirichlet packagealready has a JsonVectorAdapter which could be rewritten. It is a bigchange to the on-disk format but most jobs have an initial step toconsume e.g. csv files so changing it should not break muchcompatibility. I will take another crack at it.


Jeff


Grant Ingersoll wrote:

I'm about to write some code to prepare docs for clustering and I knowat least a few others on the list here have done the same. I waswondering if anyone is in the position to share their code andcontribute to Mahout.
As I see it, we need to be able to take in text and create the matrixof terms, where each cell is the TF/IDF (or some other weight, wouldbe nice to be pluggable) and then normalize the vector (and, accordingto Ted, we should support using different norms). Seems like we alsoneed the label stuff in place(https://issues.apache.org/jira/browse/MAHOUT-65) but I'm not sure onthe state of that patch.
As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver,but it needs to be a more generic. I realize we could use Lucene, buthaving a solution that scales w/ Lucene is going to take work, AIUI,whereas a M/R job seems more straightforward.
I'd like to be able to get this stuff committed relatively soon andhave the examples for other people. My shorter term goal is I'mworking on some demos using Wikipedia.
Thanks,
Grant

Re: Document Clustering

Reply via email to