Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

Ted Dunning Thu, 11 Apr 2013 07:13:44 -0700

Fast iterators over different views of an index is definitely good for
recommendation.


An in-memory version is also nice.

A new JIRA for the part that you don't have is probably a good thing.
 Submit the patch you have so that we can make progress.



On Thu, Apr 11, 2013 at 5:20 AM, Gokhan Capan <gkhn...@gmail.com> wrote:

> Ok,
>
> Honestly I didn't understand the cross-recommendation, and I guess for
> possible persistent Lucene Matrix implementation, the desired feature is a
> fast iterator, which computes next by querying the index. Am I correct?
>
> Should I submit the diff for in memory version to MAHOUT-1178, or create a
> separate issue?
>
>
> On Wed, Apr 10, 2013 at 5:00 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > This is awesome.  Not exactly what I asked for, but in some ways better
> > than what I asked for (I love it when that happens).  I think a
> sequential
> > implementation like this is a fine place to start.
> >
> > The array of matrices should work very well for moderate to small cross
> > recommendation.  If we have a sharded index, then we can build an
> > InputFormat on the top of this pretty easily.
> >
> > Can you put up a JIRA and a patch for this?
> >
> >
> > On Tue, Apr 9, 2013 at 10:43 AM, Gokhan Capan <gkhn...@gmail.com> wrote:
> >
> > > I have an implementation of "casting" a Lucene index to a
> > SparseRowMatrix,
> > > with following properties:
> > >
> > > - Row vectors are named and labeled with unique identifier id
> > > - Column vectors are labeled with terms
> > > - Dimensionality is numDocs * vocabularySize
> > > - It works on StringField, too.
> > > - It has a static creator for multiple fields, returns an array of
> > matrix.
> > > - It doesn't support numerical fields, yet.
> > >
> > > The code is tested, and I use it for instantiating matrices from Lucene
> > > indexes. I can submit a patch if it is desired.
> > >
> > > This is in memory, and loads the entire index to the matrix. Lately
> I've
> > > decided to implement a persistent version of it, which is planned to
> load
> > > from index whenever a get request is made, and writes to actual index
> > with
> > > a set request. And I plan to use the docID field, which was attached as
> > the
> > > row label in previous implementation as the actual row index. Rest will
> > be
> > > the same.
> > >
> > >
> > >
> > >
> > > On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning <ted.dunn...@gmail.com>
> > > wrote:
> > >
> > > > It should be possible to view a Lucene index as a matrix.  This would
> > > > require that we standardize on a way to convert documents to rows.
> >  There
> > > > are many choices, the discussion of which should be deferred to the
> > > actual
> > > > work on the project, but there are a few obvious constraints:
> > > >
> > > > a) it should be possible to get the same result as dumping the term
> > > vectors
> > > > for each document each to a line and converting that result using
> > > standard
> > > > Mahout methods.
> > > >
> > > > b) numeric fields ought to work somehow.
> > > >
> > > > c) if there are multiple text fields that ought to work sensibly as
> > well.
> > > >  Two options include dumping multiple matrices or to convert the
> fields
> > > > into a single row of a single matrix.
> > > >
> > > > d) it should be possible to refer back from a row of the matrix to
> find
> > > the
> > > > correct document.  THis might be because we remember the Lucene doc
> > > number
> > > > or because a field is named as holding a unique id.
> > > >
> > > > e) named vectors and matrices should be used if plausible.
> > > >
> > > > On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon <
> > > dangeorge.fili...@gmail.com
> > > > >wrote:
> > > >
> > > > > ...
> > > > > Ted, could you explain a bit more what you mean by "simplify the
> > > > connection
> > > > > to Lucene for clustering and classification"? It's too vague for an
> > > idea
> > > > > proposal.
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Gokhan
> > >
> >
>
>
>
> --
> Gokhan
>

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

Reply via email to