Re: Which database should I use with Mahout

Johannes Schulte Tue, 21 May 2013 01:14:16 -0700

Thanks! Could you also add how to learn the weights you talked about, or at
least a hint? Learning weights for search engine query terms always sounds
like  "learning to rank" to me but this always seemed pretty complicated
and i never managed to try it out..



On Tue, May 21, 2013 at 8:01 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Johannes,
>
> Your summary is good.
>
> I would add that the precalculated recommendations can be large enough that
> the lookup becomes more expensive.  Your point about staleness is very
> on-point.
>
>
> On Mon, May 20, 2013 at 10:15 PM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
>
> > I think Pat is just saying that
> > time(history_lookup) (1) + time (recommendation_calculation) (2) >
> > time(precalc_lookop) (3)
> >
> > since 1 and 3 are assumed to be served by the same system class (key
> value
> > store, db) with a single key and 2 > 0.
> >
> > ed is using a lot of information that is available at recommendation time
> > and not fetched from a somewhere ("context of delivery", geolocation).
> The
> > question remaining is why the recent history is available without a
> lookup,
> > which can only be the case if the recommendation calculation is embedded
> in
> > a bigger request cycle the history is loaded somewhere else, or it's just
> > stored in the browser.
> >
> > if you would store the classical (netflix/mahout) user-item history in
> the
> > browser and use a disk matrix thing like lucene for calculation you would
> > end up in the same range.
> >
> > I think the points are more:
> >
> >
> > 1. Having more input's than the classical item-interactions
> > (geolocation->item,search_term->item ..) can be very easily carried out
> > with search index storing this precalculated "association rules"
> >
> > 2. Precalculation per user is heavyweight, stale and hard to do if the
> > context also plays a role (site the use is on e.g because you have to
> have
> > the cartesian product of recommendations prepared for every user), while
> > "real time" approach can handle it
> >
> >
> >
> >
> >
> > On Tue, May 21, 2013 at 2:00 AM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> >
> > > Inline answers.
> > >
> > >
> > > On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel <pat.fer...@gmail.com>
> > wrote:
> > >
> > > > ...
> > > > You use the user history vector as a query?
> > >
> > >
> > > The most recent suffix of the history vector.  How much is used varies
> by
> > > the purpose.
> > >
> > >
> > > > This will be a list of item IDs and strength-of-preference values
> > (maybe
> > > > 1s for purchases).
> > >
> > >
> > > Just a list of item x action codes.  No strength needed.  If you have 5
> > > point ratings, then you can have 5 actions for each item.  The
> weighting
> > > for each action can be learned.
> > >
> > >
> > > > The cooccurrence matrix has columns treated like terms and rows
> treated
> > > > like documents though both are really items.
> > >
> > >
> > > Well, they are different.  The rows are fields within documents
> > associated
> > > with an item.  Other fields include ID and other things.  The contents
> of
> > > the field are the codes associated with the item-action pairs for each
> > > non-null column.  Usually there is only one action so this reduces to a
> > > single column per item.
> > >
> > >
> > >
> > >
> > > > Does Solr support weighted term lists as queries or do you have to
> > throw
> > > > out strength-of-preference?
> > >
> > >
> > > I prefer to throw it out even though Solr would not require me to do
> so.
> > >  They weights that I want can be encoded in the document index in any
> > case.
> > >
> > >
> > > > I ask because there are cases where the query will have non '1.0'
> > values.
> > > > When the strength values are just 1 the vector is really only a list
> or
> > > > terms (items IDs).
> > > >
> > >
> > > I really don't know of any cases where this is really true.  There are
> > > actions that are categorical.  I like to separate them out or to reduce
> > to
> > > a binary case.
> > >
> > >
> > > >
> > > > This technique seems like using a doc as a query but you have reduced
> > the
> > > > doc to the form of a vector of weighted terms. I was unaware that
> Solr
> > > > allowed weighted term queries. This is really identical to using Solr
> > for
> > > > fast doc similarity queries.
> > > >
> > >
> > > It is really more like an ordinary query.  Typical recommendation
> queries
> > > are short since they are only recent history.
> > >
> > >
> > > >
> > > > ...
> > > >
> > > > Seems like you'd rule out browser based storage because you need the
> > > > history to train your next model.
> > >
> > >
> > > Nothing says that we can't store data in two places according to use.
> > >  Browser history is good for the part of the history that becomes the
> > > query.  Central storage is good for the mass of history that becomes
> > input
> > > for analytics.
> > >
> > > At least it would be in addition to a server based storage of history.
> > >
> > >
> > > Yes.  In addition to.
> > >
> > >
> > > > Another reason you wouldn't rely only on a browser storage is that it
> > > will
> > > > be occasionally destroyed. Users span multiple devices these days
> too.
> > > >
> > >
> > > This can be dealt with using cookie resurrection techniques.  Or by
> > letting
> > > the user destroy their copy of the history if they like.
> > >
> > > The user history matrix will be quite a bit larger than the user
> > > > recommendation matrix, maybe and order or two larger.
> > >
> > >
> > > I don't think so.  And it doesn't matter since this is reduced to
> > > significant cooccurrence and that is typically quite small compared to
> a
> > > list of recommendations for all users.
> > >
> > > I have 20 recs for me stored but I've purchases 100's of items, and
> have
> > > > viewed 1000's.
> > > >
> > >
> > > 20 recs is not sufficient.  Typically you need 300 for any given
> context
> > > and you need to recompute those very frequently.  If you use
> geo-specific
> > > recommendations, you may need thousands of recommendations to have
> enough
> > > geo-dispersion.  The search engine approach can handle all of that on
> the
> > > fly.
> > >
> > > Also, the cached recs are user x (20-300) non-zeros.  The sparsified
> > > item-item cooccurrence matrix is item x 50.  Moreover, search engines
> are
> > > very good at compression.  If users >> items, then item x 50 is much
> > > smaller, especially after high quality compression (6:1 is a common
> > > compression ratio).
> > >
> > > >
> > > > Given that you have to have the entire user history vector to do the
> > > query
> > > > and given that this is still a lookup from an even larger matrix than
> > the
> > > > recs/user matrix and given that you have to do the lookup before the
> > Solr
> > > > query It can't be faster than just looking up pre-calculated recs.
> > >
> > >
> > > None of this applies.  There is an item x 50 sized search index.  There
> > is
> > > a recent history that is available without a lookup.  All that is
> > required
> > > is a single Solr query and that can handle multiple kinds of history
> and
> > > geo-location and user search terms all in a single step.
> > >
> > >
> > >
> > > > In other words the query to produce the query will be more
> problematic
> > > > than the query to produce the result, right?
> > > >
> > >
> > > Nope.  No such thing, therefore cost = 0.
> > >
> > >
> > > > Something here may be "orders of magnitude" faster, but it isn't the
> > > total
> > > > elapsed time to return recs at runtime, right?
> > >
> > >
> > > Actually, it is.  Round trip of less than 10ms is common.
>  Precalculation
> > > goes away.  Export of recs nearly goes away.  Currency of
> recommendations
> > > is much higher.
> > >
> > >
> > > > Maybe what you are saying is the time to pre-calculate the recs is 0
> > > since
> > > > they are calculated at runtime but you still have to create the
> > > > cooccurrence matrix so you still need something like mahout hadoop to
> > > > produce a model and you still need to index the model with Solr and
> you
> > > > still need to lookup user history at runtime. Indexing with Solr is
> > > faster
> > > > than loading a db (8 hours? They are doing something wrong) but the
> > query
> > > > side will be slower unless I've missed something.
> > > >
> > >
> > > I am pretty sure you have.  The customer are definitely not dopes.  Th
> > > eproblem is that precalculated recs are much, much bigger due to geo
> > > constraints.
> > >
> > >
> > > >
> > > > In any case you *have* introduced a realtime rec calculation. This is
> > > able
> > > > to use user history that may be seconds old and not yet reflected in
> > the
> > > > training data (the cooccurrence matrix) and this is very interesting!
> > > >
> > > > >>
> > > > >> This will scale to thousands or tens of thousands of
> recommendations
> > > per
> > > > >> second against 10's of millions of items.  The number of users
> > doesn't
> > > > >> matter.
> > > > >>
> > > > >
> > > >
> > > > Yes, no doubt, but the history lookup is still an issue unless I've
> > > missed
> > > > something. The NoSQL queries will scale to tens of thousands of recs
> > > > against 10s of millions of items but perhaps with larger more complex
> > > > infrastructure? Not sure how Solr scales.
> > > >
> > > > Being semi-ignorant of Solr intuition says that it's doing something
> to
> > > > speed things up like using only part of the data somewhere to do
> > > > approximations. Have there been any performance comparisons of say
> > > > precision of one approach vs the other or do they return identical
> > > results?
> > > >
> > > >
> > > >
> > >
> >
>

Re: Which database should I use with Mahout

Reply via email to