Hi Grant, I have the code to create lucene index from document text and then generate document vectors from it. This is stand-alone code and not MR. Is it something that interests you?
--shashi On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <[email protected]> wrote: > I'm about to write some code to prepare docs for clustering and I know at > least a few others on the list here have done the same. I was wondering if > anyone is in the position to share their code and contribute to Mahout. > > As I see it, we need to be able to take in text and create the matrix of > terms, where each cell is the TF/IDF (or some other weight, would be nice to > be pluggable) and then normalize the vector (and, according to Ted, we > should support using different norms). Seems like we also need the label > stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65) but I'm not > sure on the state of that patch. > > As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver, but it > needs to be a more generic. I realize we could use Lucene, but having a > solution that scales w/ Lucene is going to take work, AIUI, whereas a M/R > job seems more straightforward. > > I'd like to be able to get this stuff committed relatively soon and have the > examples for other people. My shorter term goal is I'm working on some > demos using Wikipedia. > > Thanks, > Grant > > >
