Fast iterators over different views of an index is definitely good for recommendation.
An in-memory version is also nice. A new JIRA for the part that you don't have is probably a good thing. Submit the patch you have so that we can make progress. On Thu, Apr 11, 2013 at 5:20 AM, Gokhan Capan <gkhn...@gmail.com> wrote: > Ok, > > Honestly I didn't understand the cross-recommendation, and I guess for > possible persistent Lucene Matrix implementation, the desired feature is a > fast iterator, which computes next by querying the index. Am I correct? > > Should I submit the diff for in memory version to MAHOUT-1178, or create a > separate issue? > > > On Wed, Apr 10, 2013 at 5:00 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > This is awesome. Not exactly what I asked for, but in some ways better > > than what I asked for (I love it when that happens). I think a > sequential > > implementation like this is a fine place to start. > > > > The array of matrices should work very well for moderate to small cross > > recommendation. If we have a sharded index, then we can build an > > InputFormat on the top of this pretty easily. > > > > Can you put up a JIRA and a patch for this? > > > > > > On Tue, Apr 9, 2013 at 10:43 AM, Gokhan Capan <gkhn...@gmail.com> wrote: > > > > > I have an implementation of "casting" a Lucene index to a > > SparseRowMatrix, > > > with following properties: > > > > > > - Row vectors are named and labeled with unique identifier id > > > - Column vectors are labeled with terms > > > - Dimensionality is numDocs * vocabularySize > > > - It works on StringField, too. > > > - It has a static creator for multiple fields, returns an array of > > matrix. > > > - It doesn't support numerical fields, yet. > > > > > > The code is tested, and I use it for instantiating matrices from Lucene > > > indexes. I can submit a patch if it is desired. > > > > > > This is in memory, and loads the entire index to the matrix. Lately > I've > > > decided to implement a persistent version of it, which is planned to > load > > > from index whenever a get request is made, and writes to actual index > > with > > > a set request. And I plan to use the docID field, which was attached as > > the > > > row label in previous implementation as the actual row index. Rest will > > be > > > the same. > > > > > > > > > > > > > > > On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning <ted.dunn...@gmail.com> > > > wrote: > > > > > > > It should be possible to view a Lucene index as a matrix. This would > > > > require that we standardize on a way to convert documents to rows. > > There > > > > are many choices, the discussion of which should be deferred to the > > > actual > > > > work on the project, but there are a few obvious constraints: > > > > > > > > a) it should be possible to get the same result as dumping the term > > > vectors > > > > for each document each to a line and converting that result using > > > standard > > > > Mahout methods. > > > > > > > > b) numeric fields ought to work somehow. > > > > > > > > c) if there are multiple text fields that ought to work sensibly as > > well. > > > > Two options include dumping multiple matrices or to convert the > fields > > > > into a single row of a single matrix. > > > > > > > > d) it should be possible to refer back from a row of the matrix to > find > > > the > > > > correct document. THis might be because we remember the Lucene doc > > > number > > > > or because a field is named as holding a unique id. > > > > > > > > e) named vectors and matrices should be used if plausible. > > > > > > > > On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon < > > > dangeorge.fili...@gmail.com > > > > >wrote: > > > > > > > > > ... > > > > > Ted, could you explain a bit more what you mean by "simplify the > > > > connection > > > > > to Lucene for clustering and classification"? It's too vague for an > > > idea > > > > > proposal. > > > > > > > > > > > > > > > > > > > > > -- > > > Gokhan > > > > > > > > > -- > Gokhan >