Re: Which database should I use with Mahout

Ted Dunning Mon, 20 May 2013 23:02:27 -0700

Johannes,

Your summary is good.


I would add that the precalculated recommendations can be large enough that
the lookup becomes more expensive.  Your point about staleness is very
on-point.


On Mon, May 20, 2013 at 10:15 PM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> I think Pat is just saying that
> time(history_lookup) (1) + time (recommendation_calculation) (2) >
> time(precalc_lookop) (3)
>
> since 1 and 3 are assumed to be served by the same system class (key value
> store, db) with a single key and 2 > 0.
>
> ed is using a lot of information that is available at recommendation time
> and not fetched from a somewhere ("context of delivery", geolocation). The
> question remaining is why the recent history is available without a lookup,
> which can only be the case if the recommendation calculation is embedded in
> a bigger request cycle the history is loaded somewhere else, or it's just
> stored in the browser.
>
> if you would store the classical (netflix/mahout) user-item history in the
> browser and use a disk matrix thing like lucene for calculation you would
> end up in the same range.
>
> I think the points are more:
>
>
> 1. Having more input's than the classical item-interactions
> (geolocation->item,search_term->item ..) can be very easily carried out
> with search index storing this precalculated "association rules"
>
> 2. Precalculation per user is heavyweight, stale and hard to do if the
> context also plays a role (site the use is on e.g because you have to have
> the cartesian product of recommendations prepared for every user), while
> "real time" approach can handle it
>
>
>
>
>
> On Tue, May 21, 2013 at 2:00 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > Inline answers.
> >
> >
> > On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel <pat.fer...@gmail.com>
> wrote:
> >
> > > ...
> > > You use the user history vector as a query?
> >
> >
> > The most recent suffix of the history vector.  How much is used varies by
> > the purpose.
> >
> >
> > > This will be a list of item IDs and strength-of-preference values
> (maybe
> > > 1s for purchases).
> >
> >
> > Just a list of item x action codes.  No strength needed.  If you have 5
> > point ratings, then you can have 5 actions for each item.  The weighting
> > for each action can be learned.
> >
> >
> > > The cooccurrence matrix has columns treated like terms and rows treated
> > > like documents though both are really items.
> >
> >
> > Well, they are different.  The rows are fields within documents
> associated
> > with an item.  Other fields include ID and other things.  The contents of
> > the field are the codes associated with the item-action pairs for each
> > non-null column.  Usually there is only one action so this reduces to a
> > single column per item.
> >
> >
> >
> >
> > > Does Solr support weighted term lists as queries or do you have to
> throw
> > > out strength-of-preference?
> >
> >
> > I prefer to throw it out even though Solr would not require me to do so.
> >  They weights that I want can be encoded in the document index in any
> case.
> >
> >
> > > I ask because there are cases where the query will have non '1.0'
> values.
> > > When the strength values are just 1 the vector is really only a list or
> > > terms (items IDs).
> > >
> >
> > I really don't know of any cases where this is really true.  There are
> > actions that are categorical.  I like to separate them out or to reduce
> to
> > a binary case.
> >
> >
> > >
> > > This technique seems like using a doc as a query but you have reduced
> the
> > > doc to the form of a vector of weighted terms. I was unaware that Solr
> > > allowed weighted term queries. This is really identical to using Solr
> for
> > > fast doc similarity queries.
> > >
> >
> > It is really more like an ordinary query.  Typical recommendation queries
> > are short since they are only recent history.
> >
> >
> > >
> > > ...
> > >
> > > Seems like you'd rule out browser based storage because you need the
> > > history to train your next model.
> >
> >
> > Nothing says that we can't store data in two places according to use.
> >  Browser history is good for the part of the history that becomes the
> > query.  Central storage is good for the mass of history that becomes
> input
> > for analytics.
> >
> > At least it would be in addition to a server based storage of history.
> >
> >
> > Yes.  In addition to.
> >
> >
> > > Another reason you wouldn't rely only on a browser storage is that it
> > will
> > > be occasionally destroyed. Users span multiple devices these days too.
> > >
> >
> > This can be dealt with using cookie resurrection techniques.  Or by
> letting
> > the user destroy their copy of the history if they like.
> >
> > The user history matrix will be quite a bit larger than the user
> > > recommendation matrix, maybe and order or two larger.
> >
> >
> > I don't think so.  And it doesn't matter since this is reduced to
> > significant cooccurrence and that is typically quite small compared to a
> > list of recommendations for all users.
> >
> > I have 20 recs for me stored but I've purchases 100's of items, and have
> > > viewed 1000's.
> > >
> >
> > 20 recs is not sufficient.  Typically you need 300 for any given context
> > and you need to recompute those very frequently.  If you use geo-specific
> > recommendations, you may need thousands of recommendations to have enough
> > geo-dispersion.  The search engine approach can handle all of that on the
> > fly.
> >
> > Also, the cached recs are user x (20-300) non-zeros.  The sparsified
> > item-item cooccurrence matrix is item x 50.  Moreover, search engines are
> > very good at compression.  If users >> items, then item x 50 is much
> > smaller, especially after high quality compression (6:1 is a common
> > compression ratio).
> >
> > >
> > > Given that you have to have the entire user history vector to do the
> > query
> > > and given that this is still a lookup from an even larger matrix than
> the
> > > recs/user matrix and given that you have to do the lookup before the
> Solr
> > > query It can't be faster than just looking up pre-calculated recs.
> >
> >
> > None of this applies.  There is an item x 50 sized search index.  There
> is
> > a recent history that is available without a lookup.  All that is
> required
> > is a single Solr query and that can handle multiple kinds of history and
> > geo-location and user search terms all in a single step.
> >
> >
> >
> > > In other words the query to produce the query will be more problematic
> > > than the query to produce the result, right?
> > >
> >
> > Nope.  No such thing, therefore cost = 0.
> >
> >
> > > Something here may be "orders of magnitude" faster, but it isn't the
> > total
> > > elapsed time to return recs at runtime, right?
> >
> >
> > Actually, it is.  Round trip of less than 10ms is common.  Precalculation
> > goes away.  Export of recs nearly goes away.  Currency of recommendations
> > is much higher.
> >
> >
> > > Maybe what you are saying is the time to pre-calculate the recs is 0
> > since
> > > they are calculated at runtime but you still have to create the
> > > cooccurrence matrix so you still need something like mahout hadoop to
> > > produce a model and you still need to index the model with Solr and you
> > > still need to lookup user history at runtime. Indexing with Solr is
> > faster
> > > than loading a db (8 hours? They are doing something wrong) but the
> query
> > > side will be slower unless I've missed something.
> > >
> >
> > I am pretty sure you have.  The customer are definitely not dopes.  Th
> > eproblem is that precalculated recs are much, much bigger due to geo
> > constraints.
> >
> >
> > >
> > > In any case you *have* introduced a realtime rec calculation. This is
> > able
> > > to use user history that may be seconds old and not yet reflected in
> the
> > > training data (the cooccurrence matrix) and this is very interesting!
> > >
> > > >>
> > > >> This will scale to thousands or tens of thousands of recommendations
> > per
> > > >> second against 10's of millions of items.  The number of users
> doesn't
> > > >> matter.
> > > >>
> > > >
> > >
> > > Yes, no doubt, but the history lookup is still an issue unless I've
> > missed
> > > something. The NoSQL queries will scale to tens of thousands of recs
> > > against 10s of millions of items but perhaps with larger more complex
> > > infrastructure? Not sure how Solr scales.
> > >
> > > Being semi-ignorant of Solr intuition says that it's doing something to
> > > speed things up like using only part of the data somewhere to do
> > > approximations. Have there been any performance comparisons of say
> > > precision of one approach vs the other or do they return identical
> > results?
> > >
> > >
> > >
> >
>

Re: Which database should I use with Mahout

Reply via email to