Re: Document Clustering

Grant Ingersoll Thu, 28 May 2009 06:19:40 -0700

No need to worry about back compatibility at this point, even for on-disk formats. Such is life in a pre-1.0 world. Sure, we shouldstrive to minimize it and get it right the first time, but such islife in open source.


On May 28, 2009, at 9:07 AM, Jeff Eastman wrote:

I did an initial cut at MAHOUT-65 but was blocked by theserialization/deserialization needed to fully address therequirements. Now that Gson is in lib it makes sense to use it. TheDirichlet package already has a JsonVectorAdapter which could berewritten. It is a big change to the on-disk format but most jobshave an initial step to consume e.g. csv files so changing it shouldnot break much compatibility. I will take another crack at it.
Jeff

Grant Ingersoll wrote:
I'm about to write some code to prepare docs for clustering and Iknow at least a few others on the list here have done the same. Iwas wondering if anyone is in the position to share their code andcontribute to Mahout.
As I see it, we need to be able to take in text and create thematrix of terms, where each cell is the TF/IDF (or some otherweight, would be nice to be pluggable) and then normalize thevector (and, according to Ted, we should support using differentnorms). Seems like we also need the label stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65) but I'm not sure on the state of that patch.
As for the TF/IDF stuff, we sort of have it via theBayesTfIdfDriver, but it needs to be a more generic. I realize wecould use Lucene, but having a solution that scales w/ Lucene isgoing to take work, AIUI, whereas a M/R job seems morestraightforward.
I'd like to be able to get this stuff committed relatively soon andhave the examples for other people. My shorter term goal is I'mworking on some demos using Wikipedia.
Thanks,
Grant

Re: Document Clustering

Reply via email to