Decomposer (in the process of donating, just gotta choose what linear primitives to convert to!) has a DistributedMatrix which does this for the already-parsed-into SequenceFIle's of Writable Vectors, and I really like this kind of interface.
Doing things like DistributedMatrix HdfsInputTextMatrix.extractTfIdfCorpus() where this method sets up and runs a M/R job on a remote cluster, with the output also living on HDFS, and the handle you have can now do all the things which a Matrix impl can do... this kind of thing makes using the code much less like scripting some procedural Jobs, and more like actual OO programming. -jake On Fri, Nov 13, 2009 at 1:15 PM, Ted Dunning <[email protected]> wrote: > This talk combined with previous talk about preferred mode of composing > tools (script writing using java) is beginning to make me think that we > need > something like a HdfsMatrix and LocalFileMatrix which are simply wrappers > around file names, but which allow extraction of elements (for debugging > and > diagnostics and sequential implementations) or for passing to generic > driver > routines or receiving from generic conversion routines. > > Should I open a JIRA? > > On Fri, Nov 13, 2009 at 11:54 AM, Grant Ingersoll <[email protected] > >wrote: > > > Also, take a look at what the TfIdfDriver does for the classifier stuff. > > This is a M/R job for converting text for it's format. I think we can > > abstract that to be more general purpose and then move it under the Utils > > module. The only thing that likely needs to change is whether we output > the > > Writable for the classifier or whether we output a Vector. That is my > naive > > view at this point. > > > > > > -- > Ted Dunning, CTO > DeepDyve >
